Re: Failed to wait for initial partition map exchange

2016-08-01 Thread Alexey Goncharuk
The ticket is created:
https://issues.apache.org/jira/browse/IGNITE-3616

2016-07-15 1:51 GMT+03:00 Alexey Goncharuk :

> Alexey, I like the idea in general, but killing non-responsive nodes seems
>> a bit drastic to me. How about this approach:
>>
>> - print out IDs/IPs of non-responsive nodes at all times
>> - introduce a certain kill timeout for non-responsive nodes (-1 means
>> disabled)
>> - the timeout should be at least a minute after the 1st non-responsive
>> node
>> message is printed
>> - when the timeout expires, we should kill the nodes and automatically
>> collect their thread dumps
>> - we should print out a message asking users to provide these thread dumps
>> to us via Jira or dev list
>>
>> What do you think?
>>
>
> Sounds like a plan. I will create a ticket soon if there are no objections.
>
> --AG
>


Re: Failed to wait for initial partition map exchange

2016-07-14 Thread Alexey Goncharuk
>
> Alexey, I like the idea in general, but killing non-responsive nodes seems
> a bit drastic to me. How about this approach:
>
> - print out IDs/IPs of non-responsive nodes at all times
> - introduce a certain kill timeout for non-responsive nodes (-1 means
> disabled)
> - the timeout should be at least a minute after the 1st non-responsive node
> message is printed
> - when the timeout expires, we should kill the nodes and automatically
> collect their thread dumps
> - we should print out a message asking users to provide these thread dumps
> to us via Jira or dev list
>
> What do you think?
>

Sounds like a plan. I will create a ticket soon if there are no objections.

--AG


Re: Failed to wait for initial partition map exchange

2016-07-14 Thread Dmitriy Setrakyan
On Fri, Jul 15, 2016 at 12:02 AM, Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:

> This is a cross-post from a user list.
>
> We faced this issue for a lot of times before and got a lot of users
> complaining about the whole cluster freeze. We can protect a cluster from
> such a situation simply by dropping non-responsive nodes from the cluster.
> Of course, we need to get to the bottom of the root cause, and killing
> nodes may cause some data loss in the cluster, but I think it is better
> than restarting the whole cluster from scratch.
>
> To summarize, I suggest to 'kill' non-responsive nodes from topology after
> some timeout in exchange future.
>

Alexey, I like the idea in general, but killing non-responsive nodes seems
a bit drastic to me. How about this approach:

- print out IDs/IPs of non-responsive nodes at all times
- introduce a certain kill timeout for non-responsive nodes (-1 means
disabled)
- the timeout should be at least a minute after the 1st non-responsive node
message is printed
- when the timeout expires, we should kill the nodes and automatically
collect their thread dumps
- we should print out a message asking users to provide these thread dumps
to us via Jira or dev list

What do you think?


> ​
> Thoughts?
>


Re: Failed to wait for initial partition map exchange

2016-07-14 Thread Alexey Goncharuk
This is a cross-post from a user list.

We faced this issue for a lot of times before and got a lot of users
complaining about the whole cluster freeze. We can protect a cluster from
such a situation simply by dropping non-responsive nodes from the cluster.
Of course, we need to get to the bottom of the root cause, and killing
nodes may cause some data loss in the cluster, but I think it is better
than restarting the whole cluster from scratch.

To summarize, I suggest to 'kill' non-responsive nodes from topology after
some timeout in exchange future.
​
Thoughts?