[
https://issues.apache.org/jira/browse/IGNITE-25623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-25623:
---------------------------------------
Description:
When evicting a partition replica from a node, we destroy its storages:
* MV partition storage
* TX state storage
* Raft node storages
Destruction might fail midway. On node recovery, we need to find all partition
replicas which destruction was not finished and destroy them. The idea is that
all storages will have methods to find non-destroyed partitions; we'll use this
information to find the partitions we need to destroy (these are those that,
according to the assignments, should not be present on the node, but they are).
* MvTableStorage will need the following method: Set<PartitionId>
partitionIdsOnDisk(). Take a look at tableIdsOnDisk() for reference.
** Pagemem-based storage will simply scan its FS directory
** RocksDB-based storage will scan the meta column family
* TxStateStorage will need the following method: Set<PartitionId>
partitionIdsOnDisk() working in the same way as RocksDB-based storage above
* For Raft-based storages, a scanning method added in IGNITE-25621 could be
used
More precisely, on node recovery, we should do the following:
# expectedPartitionIds = findExpectedPartitionIds() (this finds IDs of all
partitions which replicas are expected to exist on this node)
# For each on-disk partitionId in StorageEngine (across all StorageEngines),
if partitionId is not in expectedPartitionIds, its storage is destroyed (we
will need to add a method to destroy just one partition MV storage to
StorageEngine)
# For each on-disk partitionId in TxStateRocksDbSharedStorage, if partitionId
is not in expectedPartitionIds, its storage is destroyed (we will need to add a
method to destroy just one partition TX storage to TxStateRocksDbSharedStorage)
# For each on-disk partitionId in ReplicaManager, if partitionId is not in
expectedPartitionIds, its storage is destroyed
findExpectedPartitionIds() should, conceptually, return union of two sets:
partitions for which current node is in the partition's stable assignments and
partitions for which current node is in the partition's pending assignments. In
reality, this is calculated in a little more involved ways (taking into account
disaster recovery).
was:
When evicting a partition replica from a node, we destroy its storages:
* MV partition storage
* TX state storage
* Raft node storages
Destruction might fail midway. On node recovery, we need to find all partition
replicas which destruction was not finished and destroy them. The idea is that
all storages will have methods to find non-destroyed partitions; we'll use this
information to find the partitions we need to destroy (these are those that,
according to the assignments, should not be present on the node, but they are).
* MvTableStorage will need the following method: Set<PartitionId>
partitionIdsOnDisk(). Take a look at tableIdsOnDisk() for reference.
** Pagemem-based storage will simply scan its FS directory
** RocksDB-based storage will scan the meta column family
* TxStateStorage will need the following method: Set<PartitionId>
partitionIdsOnDisk() working in the same way as RocksDB-based storage above
* For Raft-based storages, a scanning method added in IGNITE-25621 could be
used
On node recovery, we could do the following:
# expectedPartitionIds = findExpectedPartitionIds() (this finds IDs of all
partitions which replicas are expected to exist on this node)
# For each on-disk partitionId in StorageEngine (across all StorageEngines),
if partitionId is not in expectedPartitionIds, its storage is destroyed (we
will need to add a method to destroy just one partition MV storage to
StorageEngine)
# For each on-disk partitionId in TxStateRocksDbSharedStorage, if partitionId
is not in expectedPartitionIds, its storage is destroyed (we will need to add a
method to destroy just one partition TX storage to TxStateRocksDbSharedStorage)
# For each on-disk partitionId in ReplicaManager, if partitionId is not in
expectedPartitionIds, its storage is destroyed
> Reliable destruction of evicted partition replicas
> --------------------------------------------------
>
> Key: IGNITE-25623
> URL: https://issues.apache.org/jira/browse/IGNITE-25623
> Project: Ignite
> Issue Type: Improvement
> Reporter: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
>
> When evicting a partition replica from a node, we destroy its storages:
> * MV partition storage
> * TX state storage
> * Raft node storages
> Destruction might fail midway. On node recovery, we need to find all
> partition replicas which destruction was not finished and destroy them. The
> idea is that all storages will have methods to find non-destroyed partitions;
> we'll use this information to find the partitions we need to destroy (these
> are those that, according to the assignments, should not be present on the
> node, but they are).
> * MvTableStorage will need the following method: Set<PartitionId>
> partitionIdsOnDisk(). Take a look at tableIdsOnDisk() for reference.
> ** Pagemem-based storage will simply scan its FS directory
> ** RocksDB-based storage will scan the meta column family
> * TxStateStorage will need the following method: Set<PartitionId>
> partitionIdsOnDisk() working in the same way as RocksDB-based storage above
> * For Raft-based storages, a scanning method added in IGNITE-25621 could be
> used
> More precisely, on node recovery, we should do the following:
> # expectedPartitionIds = findExpectedPartitionIds() (this finds IDs of all
> partitions which replicas are expected to exist on this node)
> # For each on-disk partitionId in StorageEngine (across all StorageEngines),
> if partitionId is not in expectedPartitionIds, its storage is destroyed (we
> will need to add a method to destroy just one partition MV storage to
> StorageEngine)
> # For each on-disk partitionId in TxStateRocksDbSharedStorage, if
> partitionId is not in expectedPartitionIds, its storage is destroyed (we will
> need to add a method to destroy just one partition TX storage to
> TxStateRocksDbSharedStorage)
> # For each on-disk partitionId in ReplicaManager, if partitionId is not in
> expectedPartitionIds, its storage is destroyed
> findExpectedPartitionIds() should, conceptually, return union of two sets:
> partitions for which current node is in the partition's stable assignments
> and partitions for which current node is in the partition's pending
> assignments. In reality, this is calculated in a little more involved ways
> (taking into account disaster recovery).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)