[ https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pavel Kovalenko updated IGNITE-8391: ------------------------------------ Fix Version/s: (was: 2.7) 2.8 > Removing some WAL history segments leads to WAL rebalance hanging > ----------------------------------------------------------------- > > Key: IGNITE-8391 > URL: https://issues.apache.org/jira/browse/IGNITE-8391 > Project: Ignite > Issue Type: Bug > Components: cache > Affects Versions: 2.4 > Reporter: Pavel Kovalenko > Priority: Major > Fix For: 2.8 > > > Problem: > 1) Start 2 nodes, load some data to it. > 2) Stop node 2, load some data to cache. > 3) Remove WAL archived segment which doesn't contain Checkpoint record needed > to find start point for WAL rebalance, but contains necessary data for > rebalancing. > 4) Start node 2, this node will start rebalance data from node 1 using WAL. > Rebalance will be hanged with following assertion: > {noformat} > java.lang.AssertionError: Partitions after rebalance should be either done or > missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417) > at > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379) > at > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603) > at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) > at > org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) > at > org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) > at > org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > > This happened because we never reached necessary data and updateCounters > contained in removed WAL segment. > To resolve such problems we should introduce some fallback strategy if > rebalance by WAL has been failed. Example of fallback strategy is - re-run > full rebalance for partitions that were not able properly rebalanced using > WAL. -- This message was sent by Atlassian JIRA (v7.6.3#76005)