Hi,

6 seconds is quite a bit, but then again it sounds like you may have long
major GCs?

If you have a network that doesn’t play nice, I’d suggest setting the
suspicion threshold at 12. You’ll get a slower detection time but with 6
seconds of shift you don’t get that anyway. And you’ll have fever false
positives overall.

If you don’t want to change the suspicion threshold then I’d suggest
reducing the window (max-sample-size if I remember correctly) so that the
FD doesn’t get hung up on “better times” (e.g. reduce to 200 samples, i.e.
3 minutes with the default interval of 1s)

Manuel


On Friday, 2 March 2018, Nikos Viorres <nvior...@gmail.com> wrote:

>
>
> On Friday, March 2, 2018 at 6:05:11 PM UTC+2, Patrik Nordwall wrote:
>>
>> Sounds interesting/strange. For investigation it would be good with debug
>> logging and verbose-heartbeat-logging.
>>
>> I think it is config akka.cluster.debug.verbose-heartbeat-logging=on,
>> but check reference.conf
>>
>
>
> I ll try to set this on to one of the largest / noisiest clusters, gather
> the logs, clean them up an report back or share (it may take a while as i
> probably can't force a deployment just for that).
>
> Cheers
> Nikos
>
>
>>
>> /Patrik
>>
>> fre 2 mars 2018 kl. 15:47 skrev Manuel Bernhardt <bernhard...@gmail.com>:
>>
>>> How big is your cluster?
>>>
>>> It looks like the failure detector takes much longer than you'd want it
>>> to (or simply does not) to no longer suspect other nodes. This could happen
>>> with an accrual FD that gets a lot of slow heartbeats. What's your value of
>>> the suspicion threshold? (akka.cluster.failure-detector.threshold)
>>>
>>> Manuel
>>>
>>> On 2 March 2018 at 14:46, Nikos Viorres <nvio...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> First off all, i'd like to state that we have a "noisy" operational
>>>> environment where network partitions occur more often than we'd like and
>>>> certain components (cluster nodes) experience high GC pause times.
>>>>
>>>> That being said, we are facing the following issue with a higher
>>>> frequency than one would expect: Nodes being marked "Unreachable" by part
>>>> of / the whole cluster for a period of time (during which there were
>>>> issues), and failing to get back to "Reachable" even after the transient
>>>> issue gets resolved. In most cases, most nodes in the cluster that had
>>>> marked such a node as Unreachable is able to re-establish communication and
>>>> move their status back to Reachable, but some node(s) fail to do so even
>>>> though evidence shows that communication with all the rest is trouble free
>>>> and there is no partition at the network layer that point in time. We
>>>> deduce the last bit by the fact that the node that's stuck to think that
>>>> the once problematic node is still Unreachable receives gossip information
>>>> from it but discards it for obvious reasons (with the message 'Ignoring
>>>> received gossip from unreachable...'). I should add at this point that no
>>>> Quarantine takes place in any of these cases and auto-shutdown is disabled.
>>>>
>>>> Does anyone have any ideas why this might be happening? By looking at
>>>> the logs / code, it is as if the offending node by some combination of
>>>> events stops sending Heartbeats permanently to the nodes that exhibit the
>>>> issue.
>>>>
>>>> --
>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>>> urrent/additional/faq.html
>>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>>> p/akka-user
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Akka User List" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to akka-user+...@googlegroups.com.
>>>> To post to this group, send email to akka...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/akka-user.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/c
>>> urrent/additional/faq.html
>>> >>>>>>>>>> Search the archives: https://groups.google.com/grou
>>> p/akka-user
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to akka-user+...@googlegroups.com.
>>> To post to this group, send email to akka...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: http://doc.akka.io/docs/akka/
> current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to akka-user+unsubscr...@googlegroups.com.
> To post to this group, send email to akka-user@googlegroups.com.
> Visit this group at https://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at https://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to