On Fri, Jul 15, 2016 at 12:02 AM, Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:

> This is a cross-post from a user list.
>
> We faced this issue for a lot of times before and got a lot of users
> complaining about the whole cluster freeze. We can protect a cluster from
> such a situation simply by dropping non-responsive nodes from the cluster.
> Of course, we need to get to the bottom of the root cause, and killing
> nodes may cause some data loss in the cluster, but I think it is better
> than restarting the whole cluster from scratch.
>
> To summarize, I suggest to 'kill' non-responsive nodes from topology after
> some timeout in exchange future.
>

Alexey, I like the idea in general, but killing non-responsive nodes seems
a bit drastic to me. How about this approach:

- print out IDs/IPs of non-responsive nodes at all times
- introduce a certain kill timeout for non-responsive nodes (-1 means
disabled)
- the timeout should be at least a minute after the 1st non-responsive node
message is printed
- when the timeout expires, we should kill the nodes and automatically
collect their thread dumps
- we should print out a message asking users to provide these thread dumps
to us via Jira or dev list

What do you think?


> ​
> Thoughts?
>

Reply via email to