Pavel Kovalenko created IGNITE-8391:
---------------------------------------
Summary: Removing some WAL history segments leads to WAL rebalance
hanging
Key: IGNITE-8391
URL: https://issues.apache.org/jira/browse/IGNITE-8391
Project: Ignite
Issue Type: Bug
Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Fix For: 2.6
Problem:
1) Start 2 nodes, load some data to it.
2) Stop node 2, load some data to cache.
3) Remove WAL archived segment which doesn't contain Checkpoint record needed
to find start point for WAL rebalance, but contains necessary data for
rebalancing.
4) Start node 2, this node will start rebalance data from node 1 using WAL.
Rebalance will be hanged with following assertion:
{noformat}
java.lang.AssertionError: Partitions after rebalance should be either done or
missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
This happened because we never reached necessary data and updateCounters
contained in removed WAL segment.
To resolve such problems we should introduce some fallback strategy if
rebalance by WAL has been failed. Example of fallback strategy is - re-run full
rebalance for partitions that were not able properly rebalanced using WAL.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)