[jira] [Updated] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.

Alexei Scherbakov (JIRA) Wed, 31 Oct 2018 00:27:24 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexei Scherbakov updated IGNITE-10078:
---------------------------------------
    Description: 
This is possible if some updates are not written to WAL before node failure. 
They will be not applied by rebalancing due to same partition counters in 
certain scenario:

1. Start grid with 3 nodes, 2 backups.
2. Preload some data to partition P.
3. Start two concurrent transactions writing single key to the same partition, 
keys are different
{noformat}
try(Transaction tx = client.transactions().txStart(PESSIMISTIC, 
REPEATABLE_READ, 0, 1)) {
      client.cache(DEFAULT_CACHE_NAME).put(k, v);

      tx.commit();
}
{noformat}
4. Order updates on backup in the way such update with max partition counter is 
written to WAL and update with lesser partition counter failed due to 
triggering of FH before it's added to WAL

5. Return failed node to grid, observe no rebalancing due to same partition 
counters.

Possible solution: detect gaps in update counters on recovery and force 
rebalance from a node without gaps if detected.


  was:
This is possible if some updates with lower partition counter are not written 
to WAL before node failure.

Scenario:

1. Start grid with 3 nodes, 2 backups.
2. Preload some data to partition P.
3. Start two concurrent transactions writing single key to the same partition, 
keys are different
{noformat}
try(Transaction tx = client.transactions().txStart(PESSIMISTIC, 
REPEATABLE_READ, 0, 1)) {
      client.cache(DEFAULT_CACHE_NAME).put(k, v);

      tx.commit();
}
{noformat}
4. Order updates on backup in the way such update with max partition counter is 
written to WAL and update with lesser partition counter failed due to 
triggering of FH before it's added to WAL

5. Return failed node to grid, observe no rebalancing due to same partition 
counters.

Possible solution: detect gaps in update counters on recovery and force 
rebalance from a node without gaps if detected.



> Node failure during concurrent partition updates may cause partition desync 
> between primary and backup.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-10078
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10078
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexei Scherbakov
>            Assignee: Alexei Scherbakov
>            Priority: Major
>             Fix For: 2.8
>
>
> This is possible if some updates are not written to WAL before node failure. 
> They will be not applied by rebalancing due to same partition counters in 
> certain scenario:
> 1. Start grid with 3 nodes, 2 backups.
> 2. Preload some data to partition P.
> 3. Start two concurrent transactions writing single key to the same 
> partition, keys are different
> {noformat}
> try(Transaction tx = client.transactions().txStart(PESSIMISTIC, 
> REPEATABLE_READ, 0, 1)) {
>       client.cache(DEFAULT_CACHE_NAME).put(k, v);
>       tx.commit();
> }
> {noformat}
> 4. Order updates on backup in the way such update with max partition counter 
> is written to WAL and update with lesser partition counter failed due to 
> triggering of FH before it's added to WAL
> 5. Return failed node to grid, observe no rebalancing due to same partition 
> counters.
> Possible solution: detect gaps in update counters on recovery and force 
> rebalance from a node without gaps if detected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.

Reply via email to