Re: Failed to wait for initial partition map exchange

2016-08-01 Thread Alexey Goncharuk
The ticket is created:
https://issues.apache.org/jira/browse/IGNITE-3616

2016-07-15 1:51 GMT+03:00 Alexey Goncharuk :

> Alexey, I like the idea in general, but killing non-responsive nodes seems
>> a bit drastic to me. How about this approach:
>>
>> - print out IDs/IPs of non-responsive nodes at all times
>> - introduce a certain kill timeout for non-responsive nodes (-1 means
>> disabled)
>> - the timeout should be at least a minute after the 1st non-responsive
>> node
>> message is printed
>> - when the timeout expires, we should kill the nodes and automatically
>> collect their thread dumps
>> - we should print out a message asking users to provide these thread dumps
>> to us via Jira or dev list
>>
>> What do you think?
>>
>
> Sounds like a plan. I will create a ticket soon if there are no objections.
>
> --AG
>


Re: Failed to wait for initial partition map exchange

2016-07-14 Thread Alexey Goncharuk
>
> Alexey, I like the idea in general, but killing non-responsive nodes seems
> a bit drastic to me. How about this approach:
>
> - print out IDs/IPs of non-responsive nodes at all times
> - introduce a certain kill timeout for non-responsive nodes (-1 means
> disabled)
> - the timeout should be at least a minute after the 1st non-responsive node
> message is printed
> - when the timeout expires, we should kill the nodes and automatically
> collect their thread dumps
> - we should print out a message asking users to provide these thread dumps
> to us via Jira or dev list
>
> What do you think?
>

Sounds like a plan. I will create a ticket soon if there are no objections.

--AG


Re: Failed to wait for initial partition map exchange

2016-07-14 Thread Dmitriy Setrakyan
On Fri, Jul 15, 2016 at 12:02 AM, Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:

> This is a cross-post from a user list.
>
> We faced this issue for a lot of times before and got a lot of users
> complaining about the whole cluster freeze. We can protect a cluster from
> such a situation simply by dropping non-responsive nodes from the cluster.
> Of course, we need to get to the bottom of the root cause, and killing
> nodes may cause some data loss in the cluster, but I think it is better
> than restarting the whole cluster from scratch.
>
> To summarize, I suggest to 'kill' non-responsive nodes from topology after
> some timeout in exchange future.
>

Alexey, I like the idea in general, but killing non-responsive nodes seems
a bit drastic to me. How about this approach:

- print out IDs/IPs of non-responsive nodes at all times
- introduce a certain kill timeout for non-responsive nodes (-1 means
disabled)
- the timeout should be at least a minute after the 1st non-responsive node
message is printed
- when the timeout expires, we should kill the nodes and automatically
collect their thread dumps
- we should print out a message asking users to provide these thread dumps
to us via Jira or dev list

What do you think?


> ​
> Thoughts?
>


Re: Failed to wait for initial partition map exchange

2016-07-14 Thread Alexey Goncharuk
This is a cross-post from a user list.

We faced this issue for a lot of times before and got a lot of users
complaining about the whole cluster freeze. We can protect a cluster from
such a situation simply by dropping non-responsive nodes from the cluster.
Of course, we need to get to the bottom of the root cause, and killing
nodes may cause some data loss in the cluster, but I think it is better
than restarting the whole cluster from scratch.

To summarize, I suggest to 'kill' non-responsive nodes from topology after
some timeout in exchange future.
​
Thoughts?


[jira] [Created] (IGNITE-3212) Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test

2016-05-30 Thread Ksenia Rybakova (JIRA)
Ksenia Rybakova created IGNITE-3212:
---

 Summary: Servers get stuck with the warning "Failed to wait for 
initial partition map exchange" during falover test
 Key: IGNITE-3212
 URL: https://issues.apache.org/jira/browse/IGNITE-3212
 Project: Ignite
  Issue Type: Bug
Affects Versions: 1.6
Reporter: Ksenia Rybakova


Servers being restarted during falover test get stuck after some time with the 
warning "Failed to wait for initial partition map exchange". 
{noformat}
[08:44:41,303][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Added 
new node to topology: TcpDiscoveryNode 
[id=db557f04-43b7-4e28-ae0d-d4dcf4139c89, addrs=
[10.20.0.222, 127.0.0.1], sockAddrs=[fosters-222/10.20.0.222:47503, 
/10.20.0.222:47503, /127.0.0.1:47503], discPort=47503, order=44, intOrder=32, 
lastExchangeTime=1464
363880917, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
[08:44:41,304][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
Topology snapshot [ver=44, servers=19, clients=1, CPUs=64, heap=160.0GB]
[08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Added 
new node to topology: TcpDiscoveryNode 
[id=6fae61a7-c1c1-40e5-8ad0-8bf5d6c86eb7, addrs=
[10.20.0.223, 127.0.0.1], sockAddrs=[fosters-223/10.20.0.223:47503, 
/10.20.0.223:47503, /127.0.0.1:47503], discPort=47503, order=45, intOrder=33, 
lastExchangeTime=1464
363910999, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
[08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
Topology snapshot [ver=45, servers=20, clients=1, CPUs=64, heap=170.0GB]
[08:45:19,942][INFO ][ignite-update-notifier-timer][GridUpdateNotifier] Update 
status is not available.
[08:46:20,370][WARN ][main][GridCachePartitionExchangeManager] Failed to wait 
for initial partition map exchange. Possible reasons are:
  ^-- Transactions in deadlock.
  ^-- Long running transactions (ignore if this is the case).
  ^-- Unreleased explicit locks.
[08:48:30,375][WARN ][main][GridCachePartitionExchangeManager] Still waiting 
for initial partition map exchange ...
{noformat}

"Failed to wait for partition release future" warnings are on other nodes.
{noformat}
[08:09:45,822][WARN 
][exchange-worker-#82%null%][GridDhtPartitionsExchangeFuture] Failed to wait 
for partition release future [topVer=AffinityTopologyVersion [topVer=29, 
minorTopVer=0], node=cab5d0e0-7365-4774-8f99-d9f131c5d896]. Dumping pending 
objects that might be the cause:
[08:09:45,822][WARN 
][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Ready affinity 
version: AffinityTopologyVersion [topVer=28, minorTopVer=1]
[08:09:45,826][WARN 
][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Last exchange 
future: GridDhtPartitionsExchangeFuture ...
{noformat}

Load config:
- 1 client, 20 servers (5 servers per 1 host)
- warmup 60
- duration 66h
- preload 5M
- key range 10M
- operations: PUT PUT_ALL GET GET_ALL INVOKE INVOKE_ALL REMOVE REMOVE_ALL 
PUT_IF_ABSENT REPLACE
- backups count 3
- 3 servers restart every 15 min with 30 sec step, pause between stop and start 
5min








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)