Re: Failed to wait for initial partition map exchange
The ticket is created: https://issues.apache.org/jira/browse/IGNITE-3616 2016-07-15 1:51 GMT+03:00 Alexey Goncharuk: > Alexey, I like the idea in general, but killing non-responsive nodes seems >> a bit drastic to me. How about this approach: >> >> - print out IDs/IPs of non-responsive nodes at all times >> - introduce a certain kill timeout for non-responsive nodes (-1 means >> disabled) >> - the timeout should be at least a minute after the 1st non-responsive >> node >> message is printed >> - when the timeout expires, we should kill the nodes and automatically >> collect their thread dumps >> - we should print out a message asking users to provide these thread dumps >> to us via Jira or dev list >> >> What do you think? >> > > Sounds like a plan. I will create a ticket soon if there are no objections. > > --AG >
Re: Failed to wait for initial partition map exchange
> > Alexey, I like the idea in general, but killing non-responsive nodes seems > a bit drastic to me. How about this approach: > > - print out IDs/IPs of non-responsive nodes at all times > - introduce a certain kill timeout for non-responsive nodes (-1 means > disabled) > - the timeout should be at least a minute after the 1st non-responsive node > message is printed > - when the timeout expires, we should kill the nodes and automatically > collect their thread dumps > - we should print out a message asking users to provide these thread dumps > to us via Jira or dev list > > What do you think? > Sounds like a plan. I will create a ticket soon if there are no objections. --AG
Re: Failed to wait for initial partition map exchange
On Fri, Jul 15, 2016 at 12:02 AM, Alexey Goncharuk < alexey.goncha...@gmail.com> wrote: > This is a cross-post from a user list. > > We faced this issue for a lot of times before and got a lot of users > complaining about the whole cluster freeze. We can protect a cluster from > such a situation simply by dropping non-responsive nodes from the cluster. > Of course, we need to get to the bottom of the root cause, and killing > nodes may cause some data loss in the cluster, but I think it is better > than restarting the whole cluster from scratch. > > To summarize, I suggest to 'kill' non-responsive nodes from topology after > some timeout in exchange future. > Alexey, I like the idea in general, but killing non-responsive nodes seems a bit drastic to me. How about this approach: - print out IDs/IPs of non-responsive nodes at all times - introduce a certain kill timeout for non-responsive nodes (-1 means disabled) - the timeout should be at least a minute after the 1st non-responsive node message is printed - when the timeout expires, we should kill the nodes and automatically collect their thread dumps - we should print out a message asking users to provide these thread dumps to us via Jira or dev list What do you think? > > Thoughts? >
Re: Failed to wait for initial partition map exchange
This is a cross-post from a user list. We faced this issue for a lot of times before and got a lot of users complaining about the whole cluster freeze. We can protect a cluster from such a situation simply by dropping non-responsive nodes from the cluster. Of course, we need to get to the bottom of the root cause, and killing nodes may cause some data loss in the cluster, but I think it is better than restarting the whole cluster from scratch. To summarize, I suggest to 'kill' non-responsive nodes from topology after some timeout in exchange future. Thoughts?
[jira] [Created] (IGNITE-3212) Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test
Ksenia Rybakova created IGNITE-3212: --- Summary: Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test Key: IGNITE-3212 URL: https://issues.apache.org/jira/browse/IGNITE-3212 Project: Ignite Issue Type: Bug Affects Versions: 1.6 Reporter: Ksenia Rybakova Servers being restarted during falover test get stuck after some time with the warning "Failed to wait for initial partition map exchange". {noformat} [08:44:41,303][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=db557f04-43b7-4e28-ae0d-d4dcf4139c89, addrs= [10.20.0.222, 127.0.0.1], sockAddrs=[fosters-222/10.20.0.222:47503, /10.20.0.222:47503, /127.0.0.1:47503], discPort=47503, order=44, intOrder=32, lastExchangeTime=1464 363880917, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false] [08:44:41,304][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Topology snapshot [ver=44, servers=19, clients=1, CPUs=64, heap=160.0GB] [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=6fae61a7-c1c1-40e5-8ad0-8bf5d6c86eb7, addrs= [10.20.0.223, 127.0.0.1], sockAddrs=[fosters-223/10.20.0.223:47503, /10.20.0.223:47503, /127.0.0.1:47503], discPort=47503, order=45, intOrder=33, lastExchangeTime=1464 363910999, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false] [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] Topology snapshot [ver=45, servers=20, clients=1, CPUs=64, heap=170.0GB] [08:45:19,942][INFO ][ignite-update-notifier-timer][GridUpdateNotifier] Update status is not available. [08:46:20,370][WARN ][main][GridCachePartitionExchangeManager] Failed to wait for initial partition map exchange. Possible reasons are: ^-- Transactions in deadlock. ^-- Long running transactions (ignore if this is the case). ^-- Unreleased explicit locks. [08:48:30,375][WARN ][main][GridCachePartitionExchangeManager] Still waiting for initial partition map exchange ... {noformat} "Failed to wait for partition release future" warnings are on other nodes. {noformat} [08:09:45,822][WARN ][exchange-worker-#82%null%][GridDhtPartitionsExchangeFuture] Failed to wait for partition release future [topVer=AffinityTopologyVersion [topVer=29, minorTopVer=0], node=cab5d0e0-7365-4774-8f99-d9f131c5d896]. Dumping pending objects that might be the cause: [08:09:45,822][WARN ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Ready affinity version: AffinityTopologyVersion [topVer=28, minorTopVer=1] [08:09:45,826][WARN ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Last exchange future: GridDhtPartitionsExchangeFuture ... {noformat} Load config: - 1 client, 20 servers (5 servers per 1 host) - warmup 60 - duration 66h - preload 5M - key range 10M - operations: PUT PUT_ALL GET GET_ALL INVOKE INVOKE_ALL REMOVE REMOVE_ALL PUT_IF_ABSENT REPLACE - backups count 3 - 3 servers restart every 15 min with 30 sec step, pause between stop and start 5min -- This message was sent by Atlassian JIRA (v6.3.4#6332)