[
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579657#comment-16579657
]
Dmitry Sherstobitov commented on IGNITE-7165:
---------------------------------------------
For now, I have no reproducer on Java.
I've investigated persistent store in my test and found that there is
rebalanced data in storage on the node with cleared LFS, but metrics
LocalNodeMovingPartitionsCount is definitely broken after client node joins the
cluster. If I remove the client join event after the node is back - rebalance
finished correctly.
Here is code from my test log: (Rebalance didn't finish in 240 seconds, while
in previous versions it's done in 10-15 seconds)
[13:14:17][:568 :617] Wait rebalance to finish 8/240Current metric state for
cache cache_group_3_088 on node 2: 19
....
[13:18:04][:568 :617] Wait rebalance to finish 235/240Current metric state for
cache cache_group_3_088 on node 2: 19
> Re-balancing is cancelled if client node joins
> ----------------------------------------------
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
> Issue Type: Bug
> Reporter: Mikhail Cherkasov
> Assignee: Maxim Muzafarov
> Priority: Critical
> Labels: rebalance
> Fix For: 2.7
>
> Attachments: node-NO_REBALANCE-7165.log
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
> Added new node to topology: TcpDiscoveryNode
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1,
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0,
> /172.31.16.213:0], discPort=0, order=36, intOrder=24,
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe,
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
> Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef,
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
> Finish exchange future [startVer=AffinityTopologyVersion [topVer=36,
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
> Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
> [topVer=36, minorTopVer=0], evt=NODE_JOINED,
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
> Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
> Rebalancing started [top=null, evt=NODE_JOINED,
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=b3a8be53-e61f-4023-a906-a265923837ba, partitionsCount=15,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=f825cb4e-7dcc-405f-a40d-c1dc1a3ade5a, partitionsCount=12,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=4ae1db91-8b88-4180-a84b-127a303959e9, partitionsCount=11,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
> Starting rebalancing [mode=ASYNC,
> fromNode=7c286481-7638-49e4-8c68-fa6aa65d8b76, partitionsCount=18,
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0],
> updateSeq=-1754630006]
> so in clusters with a big amount of data and the frequent client left/join
> events this means that a new server will never receive its partitions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)