[ 
https://issues.apache.org/jira/browse/IGNITE-25621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-25621:
---------------------------------------
    Description: 
On node recovery (in TableManager), we determine which tables were dropped and 
their drop moment is already under LWM (and which hence have to be destroyed), 
and we destroy those tables.

We currently determine such tables using the Catalog, and it's a problem 
because the Catalog could be compacted before the node went down, so the 
Catalog will not contain information about the tables-to-be-destroyed on node 
recovery.

Currently, there are the following storages where parts of table are stored:
 # MV table storage
 # TX state table storage (only when colocation is off)
 # Raft logs of table partitions (only when colocation is off)
 # Raft meta storages of table partitions (only when colocation is off)

The idea is that on startup we query/scan all those storages to find tables 
which were not destroyed (or which destruction was not completed). Those of 
these, if not present in any Catalog version or if all their versions in the 
Catalog are below LWM, should be destroyed on node recovery.
 * For MV table storage, we need to add a method like Set<Integer> 
nonDestroyedTables() to StorageEngine.
 ** Pagemem-based storage would simply scan its FS directory
 ** RocksDB-based implementation we probably need to store list of 
not-yet-destroyed table IDs explicitly in the meta column family, for 
efficiency. On first start on a node with old PDS, we will probably need a full 
scan of the storage to find all table IDs
 * For TX state table storage, we'll need an analogous method in 
TxStateRocksDbSharedStorage (but it will return IDs of tables or zones); in 
terms of implementation, the situation is similar to RocksDB-based 
implementation of MV table storage
 * For Raft-based storages, a method is to be added to RaftServer that will 
scan storages of partitions by prefix of groupId
 ** For Raft logs, log storage family will have an analogous method, which 
implementation could be similar to RocksDB-based cases above.
 ** For Raft meta storages, we could scan directories. 

  was:
On node recovery (in TableManager), we determine which tables were dropped and 
their drop moment is already under LWM (and which hence have to be destroyed), 
and we destroy those tables.

We currently determine such tables using the Catalog, and it's a problem 
because the Catalog could be compacted before the node went down, so the 
Catalog will not contain information about the tables-to-be-destroyed on node 
recovery.

Currently, there are the following storages where parts of table are stored:
 # MV table storage
 # TX state table storage (only when colocation is off)
 # Raft logs of table partitions (only when colocation is off)
 # Raft meta storages of table partitions (only when colocation is off)

The idea is that on startup we query/scan all those storages to find tables 
which were not destroyed (or which destruction was not completed). Those of 
these, if not present in any Catalog version or if all their versions in the 
Catalog are below LWM, should be destroyed on node recovery.
 * For MV table storage, we need to add a method like Set<Integer> 
nonDestroyedTables() to StorageEngine.
 ** Pagemem-based storage would simply scan its FS directory
 ** RocksDB-based implementation we will probably need to store list of 
not-yet-destroyed table IDs explicitly in the meta column family, for 
efficiency. On first start on a node with old PDS, we will probably need a full 
scan of the storage to find all table IDs
 * For TX state table storage, we'll need an analogous method in 
TxStateRocksDbSharedStorage (but it will return IDs of tables or zones); in 
terms of implementation, the situation is similar to RocksDB-based 
implementation of MV table storage
 * For Raft-based storages, a method is to be added to RaftServer that will 
scan storages of partitions by prefix of groupId
 ** For Raft logs, log storage family will have an analogous method, which 
implementation could be similar to RocksDB-based cases above.
 ** For Raft meta storages, we could scan directories. 


> Reliable table destruction on node recovery
> -------------------------------------------
>
>                 Key: IGNITE-25621
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25621
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>
> On node recovery (in TableManager), we determine which tables were dropped 
> and their drop moment is already under LWM (and which hence have to be 
> destroyed), and we destroy those tables.
> We currently determine such tables using the Catalog, and it's a problem 
> because the Catalog could be compacted before the node went down, so the 
> Catalog will not contain information about the tables-to-be-destroyed on node 
> recovery.
> Currently, there are the following storages where parts of table are stored:
>  # MV table storage
>  # TX state table storage (only when colocation is off)
>  # Raft logs of table partitions (only when colocation is off)
>  # Raft meta storages of table partitions (only when colocation is off)
> The idea is that on startup we query/scan all those storages to find tables 
> which were not destroyed (or which destruction was not completed). Those of 
> these, if not present in any Catalog version or if all their versions in the 
> Catalog are below LWM, should be destroyed on node recovery.
>  * For MV table storage, we need to add a method like Set<Integer> 
> nonDestroyedTables() to StorageEngine.
>  ** Pagemem-based storage would simply scan its FS directory
>  ** RocksDB-based implementation we probably need to store list of 
> not-yet-destroyed table IDs explicitly in the meta column family, for 
> efficiency. On first start on a node with old PDS, we will probably need a 
> full scan of the storage to find all table IDs
>  * For TX state table storage, we'll need an analogous method in 
> TxStateRocksDbSharedStorage (but it will return IDs of tables or zones); in 
> terms of implementation, the situation is similar to RocksDB-based 
> implementation of MV table storage
>  * For Raft-based storages, a method is to be added to RaftServer that will 
> scan storages of partitions by prefix of groupId
>  ** For Raft logs, log storage family will have an analogous method, which 
> implementation could be similar to RocksDB-based cases above.
>  ** For Raft meta storages, we could scan directories. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to