[ https://issues.apache.org/jira/browse/IGNITE-17611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Semyon Danilov updated IGNITE-17611: ------------------------------------ Reviewer: Semyon Danilov > Implement proper local storage recovery for transaction state store > ------------------------------------------------------------------- > > Key: IGNITE-17611 > URL: https://issues.apache.org/jira/browse/IGNITE-17611 > Project: Ignite > Issue Type: Improvement > Reporter: Ivan Bessonov > Assignee: Ivan Bessonov > Priority: Major > Labels: ignite-3 > Time Spent: 2h 50m > Remaining Estimate: 0h > > h3. Preliminaries > Current design expects transaction states to be replicated using the same > RAFT groups that process partition transactional data. In code this means > that there are two physical storages associated with a single state machine. > This design is easy to achieve when the system is stable, but fault tolerance > and basic node restart might introduce some complications. > h3. Partition storage design > By itself, partition storage works this way: > * every write command writes value of the RAFT log index, associated with > the command; > * this index value is written atomically with the data from the command; > * updates are accumulated in the memory buffer before being written to disk. > * upon restart, we read the value of the last applied index and proceed the > recovery process from it. It's done with RAFT snapshots infrastructure. > h3. Changes to tx state store > Basically, everything has to be repeated: > * applied index value must be introduced to tx state storage; > * updates must be atomic; > * on restart, we should use the minimal value of last applied index from > both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has > to be changed). > h3. Other necessary changes > * atomic flush must be set up for the tx state storage. WAL should be > disabled; > * snapshot command must trigger the flush. Please refer to > {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for > implementation reference. Listener class can be generified and reused; > * assertion in {{PartitionListener#onWrite}} should be removed or > drastically improved; > * read operation on storages must be prohibited until local recovery is > completed - we should apply all command up to "commitIndex" value that's been > read at the start of the node, otherwise storages may have data, inconsistent > with each other. -- This message was sent by Atlassian Jira (v8.20.10#820010)