[ 
https://issues.apache.org/jira/browse/IGNITE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semyon Danilov updated IGNITE-17611:
------------------------------------
    Reviewer: Semyon Danilov

> Implement proper local storage recovery for transaction state store
> -------------------------------------------------------------------
>
>                 Key: IGNITE-17611
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17611
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> h3. Preliminaries
> Current design expects transaction states to be replicated using the same 
> RAFT groups that process partition transactional data. In code this means 
> that there are two physical storages associated with a single state machine. 
> This design is easy to achieve when the system is stable, but fault tolerance 
> and basic node restart might introduce some complications.
> h3. Partition storage design
> By itself, partition storage works this way:
>  * every write command writes value of the RAFT log index, associated with 
> the command;
>  * this index value is written atomically with the data from the command;
>  * updates are accumulated in the memory buffer before being written to disk.
>  * upon restart, we read the value of the last applied index and proceed the 
> recovery process from it. It's done with RAFT snapshots infrastructure.
> h3. Changes to tx state store
> Basically, everything has to be repeated:
>  * applied index value must be introduced to tx state storage;
>  * updates must be atomic;
>  * on restart, we should use the minimal value of last applied index from 
> both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has 
> to be changed).
> h3. Other necessary changes
>  * atomic flush must be set up for the tx state storage. WAL should be 
> disabled;
>  * snapshot command must trigger the flush. Please refer to 
> {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for 
> implementation reference. Listener class can be generified and reused;
>  * assertion in {{PartitionListener#onWrite}} should be removed or 
> drastically improved;
>  * read operation on storages must be prohibited until local recovery is 
> completed - we should apply all command up to "commitIndex" value that's been 
> read at the start of the node, otherwise storages may have data, inconsistent 
> with each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to