[jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.

Alexei Scherbakov (JIRA) Mon, 06 May 2019 09:25:14 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833971#comment-16833971
 ]


Alexei Scherbakov commented on IGNITE-10078:
--------------------------------------------

[~agoncharuk] [~Pavlukhin]

Finally I have fixed all remaining issues with failing tests. Remaining blocker 
is not related to my change (permanently broken in master).

All comments above are addressed.

[~Pavlukhin] I'm not totally agree with _pending update_ terminology because 
pending means not happening yet. Actually an update was happened and was out of 
order, so I stick with such naming. _Gaps_ are good.

Just to clarify, this is an attempt to fix most dare issues with full and 
historical rebalancing leading to partition desync. A bunch of other related 
tickets were created and need to be addressed as soon as the contribution will 
be accepted.

Below brief description of major changes introduced by this contribution:
 # In addition to partition update counter a _reservation counter_ was 
introduced. Used on primary node to address a scenario then commit happens 
first on backup and second on primary node, on example one-phase commit. In 
such case because of requirement to increment counter only then update is 
written to WAL we need some kind of _high watermark._ HWM used for tracking 
pending (not yet applied updates) or in case of primary node failure we might 
have wrong counter. Reservation counter is only incremented on primary node and 
is synchronized between partition owners on PME. Old update counter is serving 
as _low watermark_, pointing to the upper bound of sequential updates.
 # {{WALHistoricalIterator}} is fixed.
 # Introduced {{RollbackRecord}} to correctly track state of partition updates 
- missed updates do not have corresponding RollbackRecord, but rolled back 
transactions (by hand or tx recovery) will produce proper RollbackRecord. It's 
used then by historical iterator.
 # Gaps in update sequence are persisted between checkpoints. Necessary to 
understand correct update counter(LWM) for rebalancing.
 # Implemented a way to store any metadata to partition file. New freelist is 
introduced: {{PartitionMetaStorageImpl}}. Used together with 
{{CacheFreeListImpl}}
 # Fixed several issues leading to partition desync during rebalancing, most 
notably {{GridDhtLocalPartition.rmvQueue}} overflow.

 

> Node failure during concurrent partition updates may cause partition desync 
> between primary and backup.
> -------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-10078
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10078
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexei Scherbakov
>            Assignee: Alexei Scherbakov
>            Priority: Major
>             Fix For: 2.8
>
>
> This is possible if some updates are not written to WAL before node failure. 
> They will be not applied by rebalancing due to same partition counters in 
> certain scenario:
> 1. Start grid with 3 nodes, 2 backups.
> 2. Preload some data to partition P.
> 3. Start two concurrent transactions writing single key to the same partition 
> P, keys are different
> {noformat}
> try(Transaction tx = client.transactions().txStart(PESSIMISTIC, 
> REPEATABLE_READ, 0, 1)) {
>       client.cache(DEFAULT_CACHE_NAME).put(k, v);
>       tx.commit();
> }
> {noformat}
> 4. Order updates on backup in the way such update with max partition counter 
> is written to WAL and update with lesser partition counter failed due to 
> triggering of FH before it's added to WAL
> 5. Return failed node to grid, observe no rebalancing due to same partition 
> counters.
> Possible solution: detect gaps in update counters on recovery and force 
> rebalance from a node without gaps if detected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-10078) Node failure during concurrent partition updates may cause partition desync between primary and backup.

Reply via email to