>
> Alexey, I like the idea in general, but killing non-responsive nodes seems
> a bit drastic to me. How about this approach:
>
> - print out IDs/IPs of non-responsive nodes at all times
> - introduce a certain kill timeout for non-responsive nodes (-1 means
> disabled)
> - the timeout should be at least a minute after the 1st non-responsive node
> message is printed
> - when the timeout expires, we should kill the nodes and automatically
> collect their thread dumps
> - we should print out a message asking users to provide these thread dumps
> to us via Jira or dev list
>
> What do you think?
>

Sounds like a plan. I will create a ticket soon if there are no objections.

--AG

Reply via email to