This is an automated email from the ASF dual-hosted git repository.
ppa pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/ignite-3.git
The following commit(s) were added to refs/heads/main by this push:
new 2529abea06 IGNITE-22712 Describe catalog compaction in README.md
(#4325)
2529abea06 is described below
commit 2529abea067da5ce7f75093d3781abf8ffc3173a
Author: Pavel Pereslegin <[email protected]>
AuthorDate: Wed Sep 11 16:29:09 2024 +0400
IGNITE-22712 Describe catalog compaction in README.md (#4325)
---
modules/catalog-compaction/README.md | 102 +++++++++++++++++++++
.../catalog-compaction/tech-notes/compaction.png | Bin 0 -> 64952 bytes
.../catalog-compaction/tech-notes/compaction.puml | 41 +++++++++
.../tech-notes/replicas-update.png | Bin 0 -> 43735 bytes
.../tech-notes/replicas-update.puml | 35 +++++++
5 files changed, 178 insertions(+)
diff --git a/modules/catalog-compaction/README.md
b/modules/catalog-compaction/README.md
index 2b1b2693f2..a2bf3038a8 100644
--- a/modules/catalog-compaction/README.md
+++ b/modules/catalog-compaction/README.md
@@ -1 +1,103 @@
# Catalog compaction module
+
+> Compaction was moved to a separate module from the catalog module to
eliminate circular
+dependencies, as it requires some components that may themselves depend on the
catalog
+module. Please refer to the catalog's module [readme](../catalog/README.md)
for more
+information about catalog service and update log.
+
+## Overview
+
+During schema changes, catalog update log stores incremental updates. Each
update
+increases the catalog version. Over time, log may grow to a humongous size.
To
+address this, snapshotting was introduced to UpdateLog. Snapshotting means
replacing
+incremental updates with a snapshot.
+
+But different components can refer to a specific version of the catalog. Until
they
+finish their work with this version, it cannot be truncated.
+
+This module introduces
[CatalogCompactionRunner](src/main/java/org/apache/ignite/internal/catalog/compaction/CatalogCompactionRunner.java)
+component. This component handles periodical catalog compaction, ensuring that
dropped versions
+of the catalog are no longer needed by any component in the cluster.
+
+## Compaction restrictions
+
+1. Catalog compaction can be performed up to the highest version, while
excluding those with activation time
+ lower or equal to the earliest active transaction begin timestamp.
+2. Catalog must not be compacted for the version that may be necessary to
replay the raft log during recovery.
+3. Index building uses a specific catalog version. This version cannot be
truncated until
+ the index build is complete.
+4. Rebalance uses a specific catalog version. This version cannot be truncated
until the rebalance
+ is complete.
+
+## Coordinator
+
+Compaction is performed from single node (compaction coordinator) that is also
the
+metastorage group leader for simplicity. Therefore, when the metastorage group
leader
+changes, the compaction coordinator also changes.
+
+The
[ElectionListener](../metastorage/src/main/java/org/apache/ignite/internal/metastorage/impl/ElectionListener.java)
+interface was introduced to listen for metastorage leader elections.
+
+## Triggering factors
+
+The process is initiated by one of the following events:
+
+1. `low watermark` changed (every 5 minutes by default)
+2. Compaction coordinator changed
+
+## Overall process description
+
+Catalog compaction consists of two main stages:
+
+1. **Replicas update**. Updates all replication groups with preset minimum
begin
+ time among all active read-write transactions in the cluster. After some
+ time (see below for details) these timestamps are published and become
+ available for the next phase.
+
+2. **Compaction**. By using the timestamps published on the previous stage
coordinator
+ calculates the minimum required version of the catalog and performs
compaction.
+
+Publishing timestamps can take a long time, and the success of compaction
depends on more
+than just these timestamps. That's why both stages run in parallel. Thus, the
compaction
+stage uses the result of the replicas update calculated at one of the previous
iterations.
+To minimize the number of network requests, both processes run simultaneously
and use a common
+[request](src/main/java/org/apache/ignite/internal/catalog/compaction/message/CatalogCompactionMinimumTimesRequest.java)
+to collect timestamps from the entire cluster in one round trip.
+
+### Replicas update stage
+
+
+
+This stage consists of the following steps:
+
+1. Each node uses
[ActiveLocalTxMinimumBeginTimeProvider](../transactions/src/main/java/org/apache/ignite/internal/tx/ActiveLocalTxMinimumBeginTimeProvider.java)
+ to determine the minimum begin time among all local active read-write
transactions and sends it to coordinator.
+2. Coordinator calculates global minimum and sends it to all nodes using
[CatalogCompactionPrepareUpdateTxBeginTimeMessage](src/main/java/org/apache/ignite/internal/catalog/compaction/message/CatalogCompactionPrepareUpdateTxBeginTimeMessage.java).
+3. Each node stores this time within replication groups for which the local
node is the leader
+ (using
[UpdateMinimumActiveTxBeginTimeReplicaRequest](../partition-replicator/src/main/java/org/apache/ignite/internal/partition/replicator/network/replication/UpdateMinimumActiveTxBeginTimeReplicaRequest.java)
+ and
[UpdateMinimumActiveTxBeginTimeCommand](../partition-replicator/src/main/java/org/apache/ignite/internal/partition/replicator/network/command/UpdateMinimumActiveTxBeginTimeCommand.java).
+4. This timestamp (let's call it `minTxTime`) is published (becomes available
to compaction process) only
+ after checkpoint happens (partition data is flushed to disk).
+
+### Compaction stage
+
+
+
+1. Each node determines the local minimum required time, this consists of the
following steps:
+ 1. Using the introduced `RaftGroupStateProvider` to determine minimum time
among all published
+ timestamps (`minTxTime`) in local replication groups.
+ 2. If `minTxTime` is not published yet, the current iteration of compaction
is aborted.
+ 3. Selects minimum between determined minimum `minTxTime` and current `low
watermark`.
+2. Each node sends the calculated local minimum timestamp to the coordinator,
+ as well as a set of local replication groups that was used to calculate the
local minimum.
+3. The coordinator determines global minimum required time.
+4. Using this time, the coordinator determines the version of the catalog up
to which (inclusive)
+ history can be trimmed.
+5. Based on the calculated catalog version, the coordinator determines the
list of required
+ partitions by it and compares actual replication groups distribution with
what was received
+ from remote nodes, and current iteration will be aborted in the following
cases:
+ 1. the logical topology is missing some node required by catalog
+ 2. some node is missing required replication group
+ 3. calculated catalog version has an index that is still building
+ 4. there is an active rebalance, which refer to the calculated (or below
calculated) version of the catalog
+6. The coordinator performs catalog compaction.
diff --git a/modules/catalog-compaction/tech-notes/compaction.png
b/modules/catalog-compaction/tech-notes/compaction.png
new file mode 100644
index 0000000000..38358dafab
Binary files /dev/null and
b/modules/catalog-compaction/tech-notes/compaction.png differ
diff --git a/modules/catalog-compaction/tech-notes/compaction.puml
b/modules/catalog-compaction/tech-notes/compaction.puml
new file mode 100644
index 0000000000..b3b4e05087
--- /dev/null
+++ b/modules/catalog-compaction/tech-notes/compaction.puml
@@ -0,0 +1,41 @@
+@startuml
+title Compaction stage
+
+database Coordinator as crd
+database "Cluster node" as node
+
+activate crd
+
+crd -> node ++ : MinimumTimesRequest
+node -> node
+note right
+ obtain previously published minimum
+ tx begin time using RaftGroupStateProvider
+ from all local replication groups
+end note
+node -> node : compute minimum among all replication groups
+node -> node : select minimum required time between low watermark and min tx
timestamp
+node -> crd -- : MinimumTimesResponse
+note right
+ response include
+ 1. local minimum required time
+ 2. local replication groups
+end note
+crd -> crd : compute global minimum required time
+crd -> crd : determine the version of the catalog up to which (inclusive)
history can be trimmed
+alt <font color="#880000">compaction aborted</font>
+ crd -[#660000]x crd : <font color="#440000">verify restrictions
+ note right #ffdddd
+ All required nodes present in logical topology
+ Timestamp was published on all replicas
+ Distribution of replicas matches expected
+ There is no active rebalance related to the compacted version (or below)
+ Calculated catalog version has no index that is still building
+ end note
+ crd -[#660000]> crd : <font color="#440000"> abort compaction
+else <font color="#006600">compaction successful</font>
+ crd -[#green]> crd : <font color="#004400"> verify restrictions
+ crd -[#green]> crd : <font color="#004400"> save catalog snapshot (trim
history)
+end
+
+@enduml
\ No newline at end of file
diff --git a/modules/catalog-compaction/tech-notes/replicas-update.png
b/modules/catalog-compaction/tech-notes/replicas-update.png
new file mode 100644
index 0000000000..0a0d894979
Binary files /dev/null and
b/modules/catalog-compaction/tech-notes/replicas-update.png differ
diff --git a/modules/catalog-compaction/tech-notes/replicas-update.puml
b/modules/catalog-compaction/tech-notes/replicas-update.puml
new file mode 100644
index 0000000000..26d2fb0632
--- /dev/null
+++ b/modules/catalog-compaction/tech-notes/replicas-update.puml
@@ -0,0 +1,35 @@
+@startuml
+title Replicas update stage
+
+database Coordinator as crd
+database "Cluster node" as node
+queue replicas as replicas
+
+activate crd
+
+crd -> node ++ : MinimumTimesRequest
+node -> node
+note right
+ determine minimum begin time among all
+ active RW transactions started locally using
+ ActiveLocalTxMinimumBeginTimeProvider
+end note
+node -> crd -- : MinimumTimesResponse
+
+crd -> crd : compute global minimum tx time
+
+crd -> node ++ : PrepareUpdateTxBeginTimeMessage
+node ->> node
+note right
+ send local
+ UpdateMinimumActiveTxBeginTimeReplicaRequest
+ to local primary replicas
+end note
+node ->> replicas : UpdateMinimumActiveTxBeginTimeCommand
+replicas ->> replicas
+note right
+ publish timestamp when flushing
+ of the partition data is completed
+end note
+
+@enduml
\ No newline at end of file