I was running into the same issue while running replication experiments. A quick fix is to increase the default value of HEARTBEAT_MAX_MISSES. There are times in a loaded cluster when some nodes become unresponsive for a few seconds and the CC marks them as dead because the defaults are too low.
On Tue, May 1, 2018 at 1:23 AM, Murtadha Hubail <[email protected]> wrote: > Indeed :-) > > On 05/01/2018, 11:03 AM, "Mike Carey" <[email protected]> wrote: > > (And several sleep cycles and network changes were involved in my case > between runs. Typical enterprise use case, right? :-)) > > > On 5/1/18 12:31 AM, Murtadha Hubail wrote: > > This is most likely caused by missing heartbeat from the NC to the > CC. Some macOS versions had issues with reestablishing connected sockets > after waking up from sleep. > > But it could also be some unexpected exception that caused the NC to > shut down. If you could share the logs with me, I can tell you for sure. > > > > Cheers, > > Murtadha > > > > On 05/01/2018, 9:06 AM, "Michael Carey" <[email protected]> > wrote: > > > > Q: Do we maybe have a stability regression in recent versions > (e.g., > > the one leading to the UW snapshot)? They have occasionally > seen things > > like this and I just did too. (The system had been running for > awhile > > in the background on my Mac - e.g., for a day or so.) > > > > Error: Cluster is in UNUSABLE state. > > One or more Node Controllers have left or haven't joined yet. > > > > > > > > > > > > >
