Hi Patrick,

Thanks a lot for your feedback.
I can confirm we are not doing any inter-clustering watch.

The issue we are facing is described in one of my previous message :
https://groups.google.com/forum/#!topic/akka-user/XcYDI2znD1s

Although the "asymmetric" network failure might propagate wrong information
through the gossip initially, I'd expect the failure detector to trigger
correctly and understand that the remaining nodes are actually all
reachable :
in the example reported, after cutting the connection from host A and host
B I'd expect ACT1 on host A to report to ACT7 on host C that ACT4 on host B
is unreachable and so on, BUT I'd also expect the failure detector to
trigger, based un received hearthbeats, and tell all the actors on host C
that all the actors on host B are reachable since what they "saw" before
was a transient state.
The failing scenario described might happened because probably the
observers of the specific node on host B were all on host A or the gossip
has been "poisoned" for other reasons ?
It's an issue not being able to tag as reachable a node that is actually
reachable.

Would be great if we could get some clarification on this.

Note 1 : this links to the issue (https://github.com/akka/akka/issues/22090)
where we could be able to understand who is observing who and why the
cluster is not able to recover from the initial transient period.
Note 2 : we have already tuned
akka.cluster.failure-detector.acceptable-heartbeat-pause and
akka.cluster.failure-detector.threshold

Cheers,
Francesco

On 3 January 2017 at 12:31, Patrik Nordwall <[email protected]>
wrote:

>
>
> On Mon, Jan 2, 2017 at 10:59 PM, 'Francesco laTorre' via Akka User List <
> [email protected]> wrote:
>
>> Hi there,
>>
>> Any clue anyone on this ?
>> Would be great if we could some help to get this aspect clarified.
>>
>> Cheers,
>> Francesco
>>
>> On 29 December 2016 at 17:12, Francesco laTorre <
>> [email protected]> wrote:
>>
>>> Hi hAkkers,
>>>
>>> From the generic configuration :
>>> http://doc.akka.io/docs/akka/current/general/configuration.html
>>>
>>> I don't really get the differences between the
>>> akka.cluster.failure-detector and akka.remote.*-failure-detector :
>>>
>>> *akka* {
>>>
>>>   [...]
>>>
>>>   *cluster* {
>>>
>>>
>>>     # Settings for the Phi accrual failure detector (
>>> http://www.jaist.ac.jp/~defago/files/pdf/IS_RR_2004_010.pdf
>>>     # [Hayashibara et al]) used by the cluster subsystem to detect
>>> unreachable
>>>     # members.
>>>     # The default PhiAccrualFailureDetector will trigger if there are no
>>> heartbeats within
>>>     # the duration heartbeat-interval + acceptable-heartbeat-pause +
>>> threshold_adjustment,
>>>     # i.e. around 5.5 seconds with default settings.
>>>     *failure-detector* {
>>>
>>>       # FQCN of the failure detector implementation.
>>>       # It must implement akka.remote.FailureDetector and have
>>>       # a public constructor with a com.typesafe.config.Config and
>>>       # akka.actor.EventStream parameter.
>>>       *implementation-class = "akka.remote.PhiAccrualFailureDetector"*
>>>
>>>       # How often keep-alive heartbeat messages should be sent to each
>>> connection.
>>>       heartbeat-interval = 1 s
>>>
>>>       # Defines the failure detector threshold.
>>>       # A low threshold is prone to generate many wrong suspicions but
>>> ensures
>>>       # a quick detection in the event of a real crash. Conversely, a
>>> high
>>>       # threshold generates fewer mistakes but needs more time to detect
>>>       # actual crashes.
>>>       threshold = 8.0
>>>
>>>       # Number of the samples of inter-heartbeat arrival times to
>>> adaptively
>>>       # calculate the failure timeout for connections.
>>>       max-sample-size = 1000
>>>
>>>       # Minimum standard deviation to use for the normal distribution in
>>>       # AccrualFailureDetector. Too low standard deviation might result
>>> in
>>>       # too much sensitivity for sudden, but normal, deviations in
>>> heartbeat
>>>       # inter arrival times.
>>>       min-std-deviation = 100 ms
>>>
>>>       # Number of potentially lost/delayed heartbeats that will be
>>>       # accepted before considering it to be an anomaly.
>>>       # This margin is important to be able to survive sudden,
>>> occasional,
>>>       # pauses in heartbeat arrivals, due to for example garbage collect
>>> or
>>>       # network drop.
>>>       acceptable-heartbeat-pause = 3 s
>>>
>>>       # Number of member nodes that each member will send heartbeat
>>> messages to,
>>>       # i.e. each node will be monitored by this number of other nodes.
>>>       monitored-by-nr-of-members = 5
>>>
>>>       # After the heartbeat request has been sent the first failure
>>> detection
>>>       # will start after this period, even though no heartbeat message
>>> has
>>>       # been received.
>>>       expected-response-after = 1 s
>>>
>>>     }
>>>
>>>   [...]
>>>
>>> }
>>>
>>> and
>>>
>>> *akka* {
>>>
>>>   [...]
>>>
>>>   *remote* {
>>>
>>>     ### Settings shared by classic remoting and Artery (the new
>>> implementation of remoting)
>>>
>>>     # If set to a nonempty string remoting will use the given dispatcher
>>> for
>>>     # its internal actors otherwise the default dispatcher is used.
>>> Please note
>>>     # that since remoting can load arbitrary 3rd party drivers (see
>>>     # "enabled-transport" and "adapters" entries) it is not guaranteed
>>> that
>>>     # every module will respect this setting.
>>>     use-dispatcher = "akka.remote.default-remote-dispatcher"
>>>
>>>     # Settings for the failure detector to monitor connections.
>>>     # For TCP it is not important to have fast failure detection, since
>>>     # most connection failures are captured by TCP itself.
>>>     # The default DeadlineFailureDetector will trigger if there are no
>>> heartbeats within
>>>     # the duration heartbeat-interval + acceptable-heartbeat-pause, i.e.
>>> 20 seconds
>>>     # with the default settings.
>>>     *transport-failure-detector* {
>>>
>>>       # FQCN of the failure detector implementation.
>>>       # It must implement akka.remote.FailureDetector and have
>>>       # a public constructor with a com.typesafe.config.Config and
>>>       # akka.actor.EventStream parameter.
>>>       *implementation-class = "akka.remote.DeadlineFailureDetector"*
>>>
>>>       # How often keep-alive heartbeat messages should be sent to each
>>> connection.
>>>       heartbeat-interval = 4 s
>>>
>>>       # Number of potentially lost/delayed heartbeats that will be
>>>       # accepted before considering it to be an anomaly.
>>>       # A margin to the `heartbeat-interval` is important to be able to
>>> survive sudden,
>>>       # occasional, pauses in heartbeat arrivals, due to for example
>>> garbage collect or
>>>       # network drop.
>>>       acceptable-heartbeat-pause = 16 s
>>>     }
>>>
>>>     # Settings for the Phi accrual failure detector (
>>> http://www.jaist.ac.jp/~defago/files/pdf/IS_RR_2004_010.pdf
>>>     # [Hayashibara et al]) used for remote death watch.
>>>     # The default PhiAccrualFailureDetector will trigger if there are no
>>> heartbeats within
>>>     # the duration heartbeat-interval + acceptable-heartbeat-pause +
>>> threshold_adjustment,
>>>     # i.e. around 12.5 seconds with default settings.
>>>     *watch-failure-detector* {
>>>
>>>       # FQCN of the failure detector implementation.
>>>       # It must implement akka.remote.FailureDetector and have
>>>       # a public constructor with a com.typesafe.config.Config and
>>>       # akka.actor.EventStream parameter.
>>>       *implementation-class = "akka.remote.PhiAccrualFailureDetector"*
>>>
>>>       # How often keep-alive heartbeat messages should be sent to each
>>> connection.
>>>       heartbeat-interval = 1 s
>>>
>>>       # Defines the failure detector threshold.
>>>       # A low threshold is prone to generate many wrong suspicions but
>>> ensures
>>>       # a quick detection in the event of a real crash. Conversely, a
>>> high
>>>       # threshold generates fewer mistakes but needs more time to detect
>>>       # actual crashes.
>>>       threshold = 10.0
>>>
>>>       # Number of the samples of inter-heartbeat arrival times to
>>> adaptively
>>>       # calculate the failure timeout for connections.
>>>       max-sample-size = 200
>>>
>>>       # Minimum standard deviation to use for the normal distribution in
>>>       # AccrualFailureDetector. Too low standard deviation might result
>>> in
>>>       # too much sensitivity for sudden, but normal, deviations in
>>> heartbeat
>>>       # inter arrival times.
>>>       min-std-deviation = 100 ms
>>>
>>>       # Number of potentially lost/delayed heartbeats that will be
>>>       # accepted before considering it to be an anomaly.
>>>       # This margin is important to be able to survive sudden,
>>> occasional,
>>>       # pauses in heartbeat arrivals, due to for example garbage collect
>>> or
>>>       # network drop.
>>>       acceptable-heartbeat-pause = 10 s
>>>
>>>
>>>       # How often to check for nodes marked as unreachable by the failure
>>>       # detector
>>>       unreachable-nodes-reaper-interval = 1s
>>>
>>>       # After the heartbeat request has been sent the first failure
>>> detection
>>>       # will start after this period, even though no heartbeat mesage has
>>>       # been received.
>>>       expected-response-after = 1 s
>>>
>>>     }
>>>
>>>     [...]
>>> }
>>>
>>> So there are all based on the heartbeats and triggers when values jump
>>> above thresholds.
>>> Akka Cluster is built on top of Akka Remote, but the configuration
>>> generates some ambiguities :
>>>
>>>    - default settings for akka.cluster.failure-detector will trigger
>>>    *PhiAccrualFailureDetector* if there are no heartbeats within *5.5s*
>>>    - default settings for akka.remote.watch-failure-detector will
>>>    trigger *PhiAccrualFailureDetector* if there are no heartbeats within*
>>>    12.5s*
>>>
>>> moreover
>>>
>>>    - akka.cluster.failure-detector is used by the cluster subsystem to
>>>    detect unreachable members.
>>>    - akka.remote.watch-failure-detector is used for remote death watch.
>>>
>>>
>>> *Q1* : when using akka cluster, if a node goes down( Ctrl+Z, GC, netork
>>> failure etc), which PhiAccrualFailureDetector is trigger and when ?
>>>
>>
> remote.watch-failure-detector is used when you watch an actor that is
> running on another node that is not part of the same cluster as the
> watcher. E.g. if you use plain akka-remote without akka-cluster or if you
> watch from one cluster to another cluster. I recommend against using watch
> across different clusters or when using plain akka-remote since it creates
> a very strong coupling between the systems.
>
> Watch between nodes of the same cluster is fine, and
> then cluster.failure-detector is used (well, it's used between cluster
> nodes also without any explicit watching).
>
>
>>
>>> *Q2* : I've enabled logs ad debug level but cannot see any of these
>>> mentioned, the only one I can see is
>>>
>>> 16:47:22.978 [activity-feeds-akka.actor.default-dispatcher-16] INFO
>>>  a.r.transport.ProtocolStateActor - No response from remote. Transport
>>> failure detector triggered. (internal state was Open)
>>> 16:47:23.109 [activity-feeds-akka.actor.default-dispatcher-6] INFO
>>>  a.r.transport.ProtocolStateActor - No response from remote. Transport
>>> failure detector triggered. (internal state was Open)
>>>
>>> which seems to be akka.remote.transport-failure-detector.
>>>
>>
> The remote.transport-failure-detector is for detecting broken TCP
> connections and is of little importance. Such connections are
> re-established and should not influence application code apart from that
> some messages may be lost.
>
>
>>
>>> Can anyone please help me tuning the configuration correctly ?
>>>
>>
> The only setting I would change is 
> akka.cluster.failure-detector.acceptable-heartbeat-pause,
> e.g. increase it to 10s if you see too many false "Marking node(s) as
> UNREACHABLE".
>
>
>
>>
>>> Cheers,
>>> Francesco
>>>
>>>
>> --
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>> urrent/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> Patrik Nordwall
> Akka Tech Lead
> Lightbend <http://www.lightbend.com/> -  Reactive apps on the JVM
> Twitter: @patriknw
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
> current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to