[
https://issues.apache.org/jira/browse/IGNITE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601825#comment-17601825
]
Semyon Danilov commented on IGNITE-17611:
-----------------------------------------
Looks good to me!
> Implement proper local storage recovery for transaction state store
> -------------------------------------------------------------------
>
> Key: IGNITE-17611
> URL: https://issues.apache.org/jira/browse/IGNITE-17611
> Project: Ignite
> Issue Type: Improvement
> Reporter: Ivan Bessonov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> h3. Preliminaries
> Current design expects transaction states to be replicated using the same
> RAFT groups that process partition transactional data. In code this means
> that there are two physical storages associated with a single state machine.
> This design is easy to achieve when the system is stable, but fault tolerance
> and basic node restart might introduce some complications.
> h3. Partition storage design
> By itself, partition storage works this way:
> * every write command writes value of the RAFT log index, associated with
> the command;
> * this index value is written atomically with the data from the command;
> * updates are accumulated in the memory buffer before being written to disk.
> * upon restart, we read the value of the last applied index and proceed the
> recovery process from it. It's done with RAFT snapshots infrastructure.
> h3. Changes to tx state store
> Basically, everything has to be repeated:
> * applied index value must be introduced to tx state storage;
> * updates must be atomic;
> * on restart, we should use the minimal value of last applied index from
> both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has
> to be changed).
> h3. Other necessary changes
> * atomic flush must be set up for the tx state storage. WAL should be
> disabled;
> * snapshot command must trigger the flush. Please refer to
> {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for
> implementation reference. Listener class can be generified and reused;
> * assertion in {{PartitionListener#onWrite}} should be removed or
> drastically improved;
> * read operation on storages must be prohibited until local recovery is
> completed - we should apply all command up to "commitIndex" value that's been
> read at the start of the node, otherwise storages may have data, inconsistent
> with each other.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)