Hi all,

First off all, i'd like to state that we have a "noisy" operational 
environment where network partitions occur more often than we'd like and 
certain components (cluster nodes) experience high GC pause times.

That being said, we are facing the following issue with a higher frequency 
than one would expect: Nodes being marked "Unreachable" by part of / the 
whole cluster for a period of time (during which there were issues), and 
failing to get back to "Reachable" even after the transient issue gets 
resolved. In most cases, most nodes in the cluster that had marked such a 
node as Unreachable is able to re-establish communication and move their 
status back to Reachable, but some node(s) fail to do so even though 
evidence shows that communication with all the rest is trouble free and 
there is no partition at the network layer that point in time. We deduce 
the last bit by the fact that the node that's stuck to think that the once 
problematic node is still Unreachable receives gossip information from it 
but discards it for obvious reasons (with the message 'Ignoring received 
gossip from unreachable...'). I should add at this point that no Quarantine 
takes place in any of these cases and auto-shutdown is disabled.

Does anyone have any ideas why this might be happening? By looking at the 
logs / code, it is as if the offending node by some combination of events 
stops sending Heartbeats permanently to the nodes that exhibit the issue. 

