[
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578279#comment-16578279
]
Dmitry Sherstobitov commented on IGNITE-7165:
---------------------------------------------
I've problem with current solution
Following test passed on version before fix, and hangs onĀ current master onĀ
first iteration.
Test hangs on JMX LocalNodeMovingPartitionsCount and looks like rebalance did
not started at all.
Repeat 10 times:
1. stop node
2. clean lfs
3. add stopped node (trigger rebalance)
4. 3 times: start 2 clients, wait for topology snapshot, close clients
5. for each cache group check JMX metrics LocalNodeMovingPartitionsCount (like
waitForFinishRebalance())
> Re-balancing is cancelled if client node joins
> ----------------------------------------------
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
> Issue Type: Bug
> Reporter: Mikhail Cherkasov
> Assignee: Maxim Muzafarov
> Priority: Critical
> Labels: rebalance
> Fix For: 2.7
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
> Added new node to topology: TcpDiscoveryNode
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1,
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0,
> /172.31.16.213:0], discPort=0, order=36, intOrder=24,
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe,
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
> Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef,
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
> Finish exchange future [startVer=AffinityTopologyVersion [topVer=36,
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
> Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
> [topVer=36, minorTopVer=0], evt=NODE_JOINED,
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
> Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
> Rebalancing started [top=null, evt=NODE_JOINED,
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=b3a8be53-e61f-4023-a906-a265923837ba, partitionsCount=15,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=f825cb4e-7dcc-405f-a40d-c1dc1a3ade5a, partitionsCount=12,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=4ae1db91-8b88-4180-a84b-127a303959e9, partitionsCount=11,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=7c286481-7638-49e4-8c68-fa6aa65d8b76, partitionsCount=18,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> so in clusters with a big amount of data and the frequent client left/join
> events this means that a new server will never receive its partitions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)