[jira] [Issue Comment Deleted] (IGNITE-7165) Re-balancing is cancelled if client node joins

Maxim Muzafarov (JIRA) Sun, 29 Jul 2018 05:11:36 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Maxim Muzafarov updated IGNITE-7165:
------------------------------------
    Comment: was deleted

(was: h5. Changes ready
 * TC: 
 * PR: [#4442|https://github.com/apache/ignite/pull/4442]
 * Upsource: 
[IGNT-CR-699|https://reviews.ignite.apache.org/ignite/review/IGNT-CR-699]

h5. Implementation details
 # _Keep topology version to rebalance (now it's not the last topology version)_
 To calculate affinity assignment difference with the last topology version we 
should save version on which rebalance is being currently running.
 # _REPLICATED cache processing_
 Affinity assignment for this type of cache always not changed. We don't need 
to stop rebalance for this cache each time new topology version arrived. 
Rebalance should be run only once, except situations when nodes {{LEFT}} or 
{{FAIL}} cluster from which cache partition being demanded for this group.
 # _EMPTY assignments handling_
 Each time {{generateAssignments}} method determind no difference with current 
topology version (return empty map) no matter how affinity changed we should 
return successfull result as fast as possible.
 # _RENTING\EVICTING partiton after PME_
 PME prepares partition to be {{RENTED}} or {{EVICTED}} if they are not assign 
on local node regarding new affinity calculation. Processing stale supply 
message (on previous versions) can lead to exceptions with getting partitions 
on local node with incorrect state. Thats why stale 
{{GridDhtPartitionSupplyMessage}} must be ignored by {{Demander}}.
 # _Clear suppy contex map changed_
 Previously, supply context map have been cleared after each topology version 
change occurs. Since we can preform rebalance not on the latest topology 
version this behavior should be changed. Clear context only for nodes 
left\failed from topology.
 # _{{LEFT}} or {{FAIL}} nodes from cluster (rebalance restart)_
 If rebalance future demand partitions from nodes which have left the cluster 
rebalance must be restarted.
 # _OWNING → MOVING on coordinator due to obsolete partititon update counter_
 Affinity assingment can have no chanes and rebalance is currently running. 
Coordinator performs PME and after megre all SingleMessages marks partitions 
with obsolete update sequence to be demanded from remote nodes (by change 
OWNING -> MOVING partition state). We should schedule new rebalance in this 
case.)

> Re-balancing is cancelled if client node joins
> ----------------------------------------------
>
>                 Key: IGNITE-7165
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7165
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mikhail Cherkasov
>            Assignee: Maxim Muzafarov
>            Priority: Critical
>              Labels: rebalance
>             Fix For: 2.7
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, 
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, 
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=36, minorTopVer=0], evt=NODE_JOINED, 
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing started [top=null, evt=NODE_JOINED, 
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=b3a8be53-e61f-4023-a906-a265923837ba, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=f825cb4e-7dcc-405f-a40d-c1dc1a3ade5a, partitionsCount=12, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=4ae1db91-8b88-4180-a84b-127a303959e9, partitionsCount=11, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=7c286481-7638-49e4-8c68-fa6aa65d8b76, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> so in clusters with a big amount of data and the frequent client left/join 
> events this means that a new server will never receive its partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Issue Comment Deleted] (IGNITE-7165) Re-balancing is cancelled if client node joins

Reply via email to