I checked the logs, and this basically was caused by the sleep/wake up. On the 
CC, we keep a timestamp of the last heartbeat that was received from each NC 
and compare it against the current system time to check if the NC missed enough 
heartbeats to be considered dead. We perform this check every 10 seconds on the 
CC. Every time you wake up your mac after a sleep period more than the max 
heartbeat miss, and the CC monitoring task runs before the other NC processes 
resume sending heartbeats, there is a possibility of this happening since the 
timestamp on the CC memory is the time of the last heartbeat received before 
you put your mac to sleep.
We have implemented a mechanism on the current master to try to reduce the 
possibility of such false positive heartbeat miss. The CC now attempts to 
contract the NC and ask it to shut down and so if the NC is actually still 
alive, the NC Service process is supposed to restart the NC process, which will 
cause it to rejoin the cluster and the cluster will become active again. 
However, our NC Service currently doesn't restart the NC process, but I think 
we should change that.
Another option to reduce, but not eliminate, the possibility of this issue is 
to increase the heartbeat miss to something very large (e.g. 24 hours). It 
might be suitable for a playground environment like Macs and PCs, but not ideal 
OOTB configuration for cluster deployment.

On 05/02/2018, 10:21 AM, "Mike Carey" <[email protected]> wrote:

    Let me know what it turns out to be!
    
    
    On 5/1/18 12:31 AM, Murtadha Hubail wrote:
    > This is most likely caused by missing heartbeat from the NC to the CC. 
Some macOS versions had issues with reestablishing connected sockets after 
waking up from sleep.
    > But it could also be some unexpected exception that caused the NC to shut 
down. If you could share the logs with me, I can tell you for sure.
    >
    > Cheers,
    > Murtadha
    >
    > On 05/01/2018, 9:06 AM, "Michael Carey" <[email protected]> wrote:
    >
    >      Q:  Do we maybe have a stability regression in recent versions (e.g.,
    >      the one leading to the UW snapshot)?  They have occasionally seen 
things
    >      like this and I just did too.  (The system had been running for 
awhile
    >      in the background on my Mac - e.g., for a day or so.)
    >      
    >      Error: Cluster is in UNUSABLE state.
    >        One or more Node Controllers have left or haven't joined yet.
    >      
    >      
    >
    >
    
    


Reply via email to