Re: Failed to wait for initial partition map exchange
The ticket is created: https://issues.apache.org/jira/browse/IGNITE-3616 2016-07-15 1:51 GMT+03:00 Alexey Goncharuk: > Alexey, I like the idea in general, but killing non-responsive nodes seems >> a bit drastic to me. How about this approach: >> >> - print out IDs/IPs of non-responsive nodes at all times >> - introduce a certain kill timeout for non-responsive nodes (-1 means >> disabled) >> - the timeout should be at least a minute after the 1st non-responsive >> node >> message is printed >> - when the timeout expires, we should kill the nodes and automatically >> collect their thread dumps >> - we should print out a message asking users to provide these thread dumps >> to us via Jira or dev list >> >> What do you think? >> > > Sounds like a plan. I will create a ticket soon if there are no objections. > > --AG >
Re: Failed to wait for initial partition map exchange
> > Alexey, I like the idea in general, but killing non-responsive nodes seems > a bit drastic to me. How about this approach: > > - print out IDs/IPs of non-responsive nodes at all times > - introduce a certain kill timeout for non-responsive nodes (-1 means > disabled) > - the timeout should be at least a minute after the 1st non-responsive node > message is printed > - when the timeout expires, we should kill the nodes and automatically > collect their thread dumps > - we should print out a message asking users to provide these thread dumps > to us via Jira or dev list > > What do you think? > Sounds like a plan. I will create a ticket soon if there are no objections. --AG
Re: Failed to wait for initial partition map exchange
On Fri, Jul 15, 2016 at 12:02 AM, Alexey Goncharuk < alexey.goncha...@gmail.com> wrote: > This is a cross-post from a user list. > > We faced this issue for a lot of times before and got a lot of users > complaining about the whole cluster freeze. We can protect a cluster from > such a situation simply by dropping non-responsive nodes from the cluster. > Of course, we need to get to the bottom of the root cause, and killing > nodes may cause some data loss in the cluster, but I think it is better > than restarting the whole cluster from scratch. > > To summarize, I suggest to 'kill' non-responsive nodes from topology after > some timeout in exchange future. > Alexey, I like the idea in general, but killing non-responsive nodes seems a bit drastic to me. How about this approach: - print out IDs/IPs of non-responsive nodes at all times - introduce a certain kill timeout for non-responsive nodes (-1 means disabled) - the timeout should be at least a minute after the 1st non-responsive node message is printed - when the timeout expires, we should kill the nodes and automatically collect their thread dumps - we should print out a message asking users to provide these thread dumps to us via Jira or dev list What do you think? > > Thoughts? >
Re: Failed to wait for initial partition map exchange
This is a cross-post from a user list. We faced this issue for a lot of times before and got a lot of users complaining about the whole cluster freeze. We can protect a cluster from such a situation simply by dropping non-responsive nodes from the cluster. Of course, we need to get to the bottom of the root cause, and killing nodes may cause some data loss in the cluster, but I think it is better than restarting the whole cluster from scratch. To summarize, I suggest to 'kill' non-responsive nodes from topology after some timeout in exchange future. Thoughts?