Alexei Scherbakov created IGNITE-10078:
------------------------------------------
Summary: Node failure during concurrent partition updates may
cause partition desync between primary and backup.
Key: IGNITE-10078
URL: https://issues.apache.org/jira/browse/IGNITE-10078
Project: Ignite
Issue Type: Bug
Reporter: Alexei Scherbakov
Assignee: Alexei Scherbakov
Fix For: 2.8
This is possible if some updates with lower partition counter are not written
to WAL before node failure.
Scenario:
1. Start grid with 3 nodes, 2 backups.
2. Preload some data to partition P.
3. Start two concurrent transactions writing single key to the same partition,
keys are different
{noformat}
try(Transaction tx = client.transactions().txStart(PESSIMISTIC,
REPEATABLE_READ, 0, 1)) {
client.cache(DEFAULT_CACHE_NAME).put(k, v);
tx.commit();
}
{noformat}
4. Order updates on backup in the way such update with max partition counter is
written to WAL and update with lesser partition counter failed due to
triggering of FH before it's added to WAL
5. Return failed node to grid, observe no rebalancing due to same partition
counters.
Possible solution: detect gaps in update counters on recovery and force
rebalance from a node without gaps if detected.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)