I checked the logs, and this basically was caused by the sleep/wake up. On the CC, we keep a timestamp of the last heartbeat that was received from each NC and compare it against the current system time to check if the NC missed enough heartbeats to be considered dead. We perform this check every 10 seconds on the CC. Every time you wake up your mac after a sleep period more than the max heartbeat miss, and the CC monitoring task runs before the other NC processes resume sending heartbeats, there is a possibility of this happening since the timestamp on the CC memory is the time of the last heartbeat received before you put your mac to sleep. We have implemented a mechanism on the current master to try to reduce the possibility of such false positive heartbeat miss. The CC now attempts to contract the NC and ask it to shut down and so if the NC is actually still alive, the NC Service process is supposed to restart the NC process, which will cause it to rejoin the cluster and the cluster will become active again. However, our NC Service currently doesn't restart the NC process, but I think we should change that. Another option to reduce, but not eliminate, the possibility of this issue is to increase the heartbeat miss to something very large (e.g. 24 hours). It might be suitable for a playground environment like Macs and PCs, but not ideal OOTB configuration for cluster deployment.
On 05/02/2018, 10:21 AM, "Mike Carey" <[email protected]> wrote: Let me know what it turns out to be! On 5/1/18 12:31 AM, Murtadha Hubail wrote: > This is most likely caused by missing heartbeat from the NC to the CC. Some macOS versions had issues with reestablishing connected sockets after waking up from sleep. > But it could also be some unexpected exception that caused the NC to shut down. If you could share the logs with me, I can tell you for sure. > > Cheers, > Murtadha > > On 05/01/2018, 9:06 AM, "Michael Carey" <[email protected]> wrote: > > Q: Do we maybe have a stability regression in recent versions (e.g., > the one leading to the UW snapshot)? They have occasionally seen things > like this and I just did too. (The system had been running for awhile > in the background on my Mac - e.g., for a day or so.) > > Error: Cluster is in UNUSABLE state. > One or more Node Controllers have left or haven't joined yet. > > > >
