[
https://issues.apache.org/jira/browse/IGNITE-20124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876740#comment-17876740
]
Roman Puchkovskiy commented on IGNITE-20124:
--------------------------------------------
The patch looks good to me
> Prevent double storage updates within primary
> ---------------------------------------------
>
> Key: IGNITE-20124
> URL: https://issues.apache.org/jira/browse/IGNITE-20124
> Project: Ignite
> Issue Type: Improvement
> Reporter: Alexander Lapin
> Assignee: Alexander Lapin
> Priority: Major
> Labels: ignite-3, transactions
> Time Spent: 4h 50m
> Remaining Estimate: 0h
>
> h3. Motivation
> In order to preserve the guarantee that the primary replica is always
> up-to-date it's required to:
> * In case of common RW transaction - insert writeIntent to the storage
> within primary before replication.
> * In case of one-phase-commit - insert commitedWrite after the replication.
> Both have already been done. However, that means that if primary is part of
> the replication group, and it's true in almost all cases, we will double the
> update:
> * In case of common RW transaction - through the replication.
> * In case of one-phase-commit - either through the replication, or though
> post update, if replication was fast enough.
> h3. Definition of Done
> * Prevent double storage updates within primary.
> h3. Implementation Notes
> The easiest way to prevent double insert is to skip one if local safe time is
> greater or equal to candidates. There are 3 places where we update partition
> storage:
> # Primary pre-replication update. In that case, the second update on
> replication should be excluded.
> # Primary post-replication update in case of 1PC. It's possible to see
> already updated data if replication was already processed locally. It is
> expected to be already covered in
> https://issues.apache.org/jira/browse/IGNITE-15927 . We should check the
> primary safe time on post-replication update and don't do update if the safe
> time is already adjusted.
> # Insert through replication. In case of !1PC on every primary there will be
> double insert (see 1). In case of 1PC it depends, so we should check the safe
> time on primary to know whether the update should be done (see 2).
> In every case, the storage indexes still should be adjusted on replication,
> as it is done now, because the progress of indexes on FSM write operations
> should not be violated - otherwise, a Raft snapshot-based rebalance would be
> broken. We may have two non-consistent storage updates on primary which may
> affect different fsyncs, but the transactional correctness isn't violated by
> these non-consistent storage updates, because there is only a possibility
> that some writes or write intents will go ahead of indexes and therefore will
> be included into snapshots - however we still can process such writes and
> resolve write intents.
> Also, the safe time needs to be updated on the primary replica now. There can
> be following scenarios:
> # Two-phase commit: we can advance safe time on primary, make
> pre-replication update and then run Raft command. Both safe time adjustment
> and storage update happen before replication.
> # One-phase commit: safe time should be advanced after completeness of Raft
> command future. There is no happens-before between the future callback and
> the replication handler, so the safe time should be checked and advanced in
> both places. We should use some critical section, preventing race between
> safe time check, safe time adjustment and storage update. For this purpose we
> can use row locks that are taken inside of {{{}storage#runConsistently{}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)