[
https://issues.apache.org/jira/browse/IGNITE-25621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy updated IGNITE-25621:
---------------------------------------
Description:
On node recovery (in TableManager), we determine which tables were dropped and
their drop moment is already under LWM (and which hence have to be destroyed),
and we destroy those tables.
We currently determine such tables using the Catalog, and it's a problem
because the Catalog could be compacted before the node went down, so the
Catalog will not contain information about the tables-to-be-destroyed on node
recovery.
Currently, there are the following storages where parts of table are stored:
# MV table storage
# TX state table storage (only when colocation is off)
# Raft logs of table partitions (only when colocation is off)
# Raft meta storages of table partitions (only when colocation is off)
The idea is that on startup we query/scan all those storages to find tables
which were not destroyed (or which destruction was not completed). Those of
these, if not present in any Catalog version or if all their versions in the
Catalog are below LWM, should be destroyed on node recovery.
* For MV table storage, we need to add a method like Set<Integer>
nonDestroyedTables() to StorageEngine.
** Pagemem-based storage would simply scan its FS directory
** RocksDB-based implementation we probably need to store list of
not-yet-destroyed table IDs explicitly in the meta column family, for
efficiency. On first start on a node with old PDS, we will probably need a full
scan of the storage to find all table IDs
* For TX state table storage, we'll need an analogous method in
TxStateRocksDbSharedStorage (but it will return IDs of tables or zones); in
terms of implementation, the situation is similar to RocksDB-based
implementation of MV table storage
* For Raft-based storages, a method is to be added to RaftServer that will
scan storages of partitions by prefix of groupId
** For Raft logs, log storage family will have an analogous method, which
implementation could be similar to RocksDB-based cases above.
** For Raft meta storages, we could scan directories.
was:
On node recovery (in TableManager), we determine which tables were dropped and
their drop moment is already under LWM (and which hence have to be destroyed),
and we destroy those tables.
We currently determine such tables using the Catalog, and it's a problem
because the Catalog could be compacted before the node went down, so the
Catalog will not contain information about the tables-to-be-destroyed on node
recovery.
Currently, there are the following storages where parts of table are stored:
# MV table storage
# TX state table storage (only when colocation is off)
# Raft logs of table partitions (only when colocation is off)
# Raft meta storages of table partitions (only when colocation is off)
The idea is that on startup we query/scan all those storages to find tables
which were not destroyed (or which destruction was not completed). Those of
these, if not present in any Catalog version or if all their versions in the
Catalog are below LWM, should be destroyed on node recovery.
* For MV table storage, we need to add a method like Set<Integer>
nonDestroyedTables() to StorageEngine.
** Pagemem-based storage would simply scan its FS directory
** RocksDB-based implementation we will probably need to store list of
not-yet-destroyed table IDs explicitly in the meta column family, for
efficiency. On first start on a node with old PDS, we will probably need a full
scan of the storage to find all table IDs
* For TX state table storage, we'll need an analogous method in
TxStateRocksDbSharedStorage (but it will return IDs of tables or zones); in
terms of implementation, the situation is similar to RocksDB-based
implementation of MV table storage
* For Raft-based storages, a method is to be added to RaftServer that will
scan storages of partitions by prefix of groupId
** For Raft logs, log storage family will have an analogous method, which
implementation could be similar to RocksDB-based cases above.
** For Raft meta storages, we could scan directories.
> Reliable table destruction on node recovery
> -------------------------------------------
>
> Key: IGNITE-25621
> URL: https://issues.apache.org/jira/browse/IGNITE-25621
> Project: Ignite
> Issue Type: Improvement
> Reporter: Roman Puchkovskiy
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
>
> On node recovery (in TableManager), we determine which tables were dropped
> and their drop moment is already under LWM (and which hence have to be
> destroyed), and we destroy those tables.
> We currently determine such tables using the Catalog, and it's a problem
> because the Catalog could be compacted before the node went down, so the
> Catalog will not contain information about the tables-to-be-destroyed on node
> recovery.
> Currently, there are the following storages where parts of table are stored:
> # MV table storage
> # TX state table storage (only when colocation is off)
> # Raft logs of table partitions (only when colocation is off)
> # Raft meta storages of table partitions (only when colocation is off)
> The idea is that on startup we query/scan all those storages to find tables
> which were not destroyed (or which destruction was not completed). Those of
> these, if not present in any Catalog version or if all their versions in the
> Catalog are below LWM, should be destroyed on node recovery.
> * For MV table storage, we need to add a method like Set<Integer>
> nonDestroyedTables() to StorageEngine.
> ** Pagemem-based storage would simply scan its FS directory
> ** RocksDB-based implementation we probably need to store list of
> not-yet-destroyed table IDs explicitly in the meta column family, for
> efficiency. On first start on a node with old PDS, we will probably need a
> full scan of the storage to find all table IDs
> * For TX state table storage, we'll need an analogous method in
> TxStateRocksDbSharedStorage (but it will return IDs of tables or zones); in
> terms of implementation, the situation is similar to RocksDB-based
> implementation of MV table storage
> * For Raft-based storages, a method is to be added to RaftServer that will
> scan storages of partitions by prefix of groupId
> ** For Raft logs, log storage family will have an analogous method, which
> implementation could be similar to RocksDB-based cases above.
> ** For Raft meta storages, we could scan directories.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)