> > Alexey, I like the idea in general, but killing non-responsive nodes seems > a bit drastic to me. How about this approach: > > - print out IDs/IPs of non-responsive nodes at all times > - introduce a certain kill timeout for non-responsive nodes (-1 means > disabled) > - the timeout should be at least a minute after the 1st non-responsive node > message is printed > - when the timeout expires, we should kill the nodes and automatically > collect their thread dumps > - we should print out a message asking users to provide these thread dumps > to us via Jira or dev list > > What do you think? >
Sounds like a plan. I will create a ticket soon if there are no objections. --AG
