On Fri, Jul 15, 2016 at 12:02 AM, Alexey Goncharuk < alexey.goncha...@gmail.com> wrote:
> This is a cross-post from a user list. > > We faced this issue for a lot of times before and got a lot of users > complaining about the whole cluster freeze. We can protect a cluster from > such a situation simply by dropping non-responsive nodes from the cluster. > Of course, we need to get to the bottom of the root cause, and killing > nodes may cause some data loss in the cluster, but I think it is better > than restarting the whole cluster from scratch. > > To summarize, I suggest to 'kill' non-responsive nodes from topology after > some timeout in exchange future. > Alexey, I like the idea in general, but killing non-responsive nodes seems a bit drastic to me. How about this approach: - print out IDs/IPs of non-responsive nodes at all times - introduce a certain kill timeout for non-responsive nodes (-1 means disabled) - the timeout should be at least a minute after the 1st non-responsive node message is printed - when the timeout expires, we should kill the nodes and automatically collect their thread dumps - we should print out a message asking users to provide these thread dumps to us via Jira or dev list What do you think? > > Thoughts? >