[ 
https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Kovalenko updated IGNITE-8391:
------------------------------------
    Fix Version/s:     (was: 2.7)
                   2.8

> Removing some WAL history segments leads to WAL rebalance hanging
> -----------------------------------------------------------------
>
>                 Key: IGNITE-8391
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8391
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.4
>            Reporter: Pavel Kovalenko
>            Priority: Major
>             Fix For: 2.8
>
>
> Problem:
> 1) Start 2 nodes, load some data to it.
> 2) Stop node 2, load some data to cache.
> 3) Remove WAL archived segment which doesn't contain Checkpoint record needed 
> to find start point for WAL rebalance, but contains necessary data for 
> rebalancing. 
> 4) Start node 2, this node will start rebalance data from node 1 using WAL.
> Rebalance will be hanged with following assertion:
> {noformat}
> java.lang.AssertionError: Partitions after rebalance should be either done or 
> missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
>       at 
> org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
>       at 
> org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  
> This happened because we never reached necessary data and updateCounters 
> contained in removed WAL segment.
> To resolve such problems we should introduce some fallback strategy if 
> rebalance by WAL has been failed. Example of fallback strategy is - re-run 
> full rebalance for partitions that were not able properly rebalanced using 
> WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to