This is an automated email from the ASF dual-hosted git repository.

ppa pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/ignite-3.git


The following commit(s) were added to refs/heads/main by this push:
     new 2529abea06 IGNITE-22712 Describe catalog compaction in README.md 
(#4325)
2529abea06 is described below

commit 2529abea067da5ce7f75093d3781abf8ffc3173a
Author: Pavel Pereslegin <[email protected]>
AuthorDate: Wed Sep 11 16:29:09 2024 +0400

    IGNITE-22712 Describe catalog compaction in README.md (#4325)
---
 modules/catalog-compaction/README.md               | 102 +++++++++++++++++++++
 .../catalog-compaction/tech-notes/compaction.png   | Bin 0 -> 64952 bytes
 .../catalog-compaction/tech-notes/compaction.puml  |  41 +++++++++
 .../tech-notes/replicas-update.png                 | Bin 0 -> 43735 bytes
 .../tech-notes/replicas-update.puml                |  35 +++++++
 5 files changed, 178 insertions(+)

diff --git a/modules/catalog-compaction/README.md 
b/modules/catalog-compaction/README.md
index 2b1b2693f2..a2bf3038a8 100644
--- a/modules/catalog-compaction/README.md
+++ b/modules/catalog-compaction/README.md
@@ -1 +1,103 @@
 # Catalog compaction module
+
+> Compaction was moved to a separate module from the catalog module to 
eliminate circular
+dependencies, as it requires some components that may themselves depend on the 
catalog
+module. Please refer to the catalog's module [readme](../catalog/README.md) 
for more
+information about catalog service and update log.
+
+## Overview
+
+During schema changes, catalog update log stores incremental updates. Each 
update
+increases the catalog version. Over time, log may grow to a humongous  size. 
To 
+address this, snapshotting was introduced to UpdateLog. Snapshotting means 
replacing 
+incremental updates with a snapshot.
+
+But different components can refer to a specific version of the catalog. Until 
they 
+finish their work with this version, it cannot be truncated.
+
+This module introduces 
[CatalogCompactionRunner](src/main/java/org/apache/ignite/internal/catalog/compaction/CatalogCompactionRunner.java)
+component. This component handles periodical catalog compaction, ensuring that 
dropped  versions
+of the catalog are no longer needed by any component in the cluster.
+
+## Compaction restrictions
+
+1. Catalog compaction can be performed up to the highest version, while 
excluding those with activation time
+   lower or equal to the earliest active transaction begin timestamp.
+2. Catalog must not be compacted for the version that may be necessary to 
replay the raft log during recovery.
+3. Index building uses a specific catalog version. This version cannot be 
truncated until
+   the index build is complete.
+4. Rebalance uses a specific catalog version. This version cannot be truncated 
until the rebalance
+   is complete.
+
+## Coordinator
+
+Compaction is performed from single node (compaction coordinator) that is also 
the
+metastorage group leader for simplicity. Therefore, when the metastorage group 
leader
+changes, the compaction coordinator also changes.
+
+The 
[ElectionListener](../metastorage/src/main/java/org/apache/ignite/internal/metastorage/impl/ElectionListener.java)
+interface was introduced to listen for metastorage leader elections.
+
+## Triggering factors
+
+The process is initiated by one of the following events:
+
+1. `low watermark` changed (every 5 minutes by default)
+2. Compaction coordinator changed
+
+## Overall process description
+
+Catalog compaction consists of two main stages:
+
+1. **Replicas update**. Updates all replication groups with preset minimum 
begin
+   time among all active read-write transactions in the cluster. After some
+   time (see below for details) these timestamps are published and become
+   available for the next phase.
+
+2. **Compaction**. By using the timestamps published on the previous stage 
coordinator
+   calculates the minimum required version of the catalog and performs 
compaction.
+
+Publishing timestamps can take a long time, and the success of compaction 
depends on more
+than just these timestamps. That's why both stages run in parallel. Thus, the 
compaction
+stage uses the result of the replicas update calculated at one of the previous 
iterations.
+To minimize the number of network requests, both processes run simultaneously 
and use a common
+[request](src/main/java/org/apache/ignite/internal/catalog/compaction/message/CatalogCompactionMinimumTimesRequest.java)
+to collect timestamps from the entire cluster in one round trip.
+
+### Replicas update stage
+
+![Replicas update](tech-notes/replicas-update.png)
+
+This stage consists of the following steps:
+
+1. Each node uses 
[ActiveLocalTxMinimumBeginTimeProvider](../transactions/src/main/java/org/apache/ignite/internal/tx/ActiveLocalTxMinimumBeginTimeProvider.java)
+   to determine the minimum begin time among all local active read-write 
transactions and sends it to coordinator.
+2. Coordinator calculates global minimum and sends it to all nodes using 
[CatalogCompactionPrepareUpdateTxBeginTimeMessage](src/main/java/org/apache/ignite/internal/catalog/compaction/message/CatalogCompactionPrepareUpdateTxBeginTimeMessage.java).
+3. Each node stores this time within replication groups for which the local 
node is the leader  
+   (using 
[UpdateMinimumActiveTxBeginTimeReplicaRequest](../partition-replicator/src/main/java/org/apache/ignite/internal/partition/replicator/network/replication/UpdateMinimumActiveTxBeginTimeReplicaRequest.java)
+   and 
[UpdateMinimumActiveTxBeginTimeCommand](../partition-replicator/src/main/java/org/apache/ignite/internal/partition/replicator/network/command/UpdateMinimumActiveTxBeginTimeCommand.java).
 
+4. This timestamp (let's call it `minTxTime`) is published (becomes available 
to compaction process) only
+   after checkpoint happens (partition data is flushed to disk).
+
+### Compaction stage
+
+![Replicas update](tech-notes/compaction.png)
+
+1. Each node determines the local minimum required time, this consists of the 
following steps:
+   1. Using the introduced `RaftGroupStateProvider` to determine minimum time 
among all published
+      timestamps (`minTxTime`) in local replication groups.
+   2. If `minTxTime` is not published yet, the current iteration of compaction 
is aborted.
+   3. Selects minimum between determined minimum `minTxTime` and current `low 
watermark`.
+2. Each node sends the calculated local minimum timestamp to the coordinator,
+   as well as a set of local replication groups that was used to calculate the 
local minimum.
+3. The coordinator determines global minimum required time.
+4. Using this time, the coordinator determines the version of the catalog up 
to which (inclusive)
+   history can be trimmed.
+5. Based on the calculated catalog version, the coordinator determines the 
list of required  
+   partitions by it and compares actual replication groups distribution with 
what was received
+   from remote nodes, and current iteration will be aborted in the following 
cases:
+   1. the logical topology is missing some node required by catalog
+   2. some node is missing required replication group
+   3. calculated catalog version has an index that is still building
+   4. there is an active rebalance, which refer to the calculated (or below 
calculated) version of the catalog
+6. The coordinator performs catalog compaction.
diff --git a/modules/catalog-compaction/tech-notes/compaction.png 
b/modules/catalog-compaction/tech-notes/compaction.png
new file mode 100644
index 0000000000..38358dafab
Binary files /dev/null and 
b/modules/catalog-compaction/tech-notes/compaction.png differ
diff --git a/modules/catalog-compaction/tech-notes/compaction.puml 
b/modules/catalog-compaction/tech-notes/compaction.puml
new file mode 100644
index 0000000000..b3b4e05087
--- /dev/null
+++ b/modules/catalog-compaction/tech-notes/compaction.puml
@@ -0,0 +1,41 @@
+@startuml
+title Compaction stage
+
+database Coordinator as crd
+database "Cluster node" as node
+
+activate crd
+
+crd -> node ++ : MinimumTimesRequest
+node -> node
+note right
+  obtain previously published minimum
+  tx begin time using RaftGroupStateProvider
+  from all local replication groups
+end note
+node -> node : compute minimum among all replication groups
+node -> node : select minimum required time between low watermark and min tx 
timestamp
+node -> crd -- : MinimumTimesResponse
+note right
+  response include
+  1. local minimum required time
+  2. local replication groups
+end note
+crd -> crd : compute global minimum required time
+crd -> crd : determine the version of the catalog up to which (inclusive) 
history can be trimmed
+alt <font color="#880000">compaction aborted</font>
+  crd -[#660000]x crd : <font color="#440000">verify restrictions
+  note right #ffdddd
+    All required nodes present in logical topology
+    Timestamp was published on all replicas
+    Distribution of replicas matches expected
+    There is no active rebalance related to the compacted version (or below)
+    Calculated catalog version has no index that is still building
+  end note
+  crd -[#660000]> crd : <font color="#440000"> abort compaction
+else <font color="#006600">compaction successful</font>
+  crd -[#green]> crd : <font color="#004400"> verify restrictions
+  crd -[#green]> crd : <font color="#004400"> save catalog snapshot (trim 
history)
+end
+
+@enduml
\ No newline at end of file
diff --git a/modules/catalog-compaction/tech-notes/replicas-update.png 
b/modules/catalog-compaction/tech-notes/replicas-update.png
new file mode 100644
index 0000000000..0a0d894979
Binary files /dev/null and 
b/modules/catalog-compaction/tech-notes/replicas-update.png differ
diff --git a/modules/catalog-compaction/tech-notes/replicas-update.puml 
b/modules/catalog-compaction/tech-notes/replicas-update.puml
new file mode 100644
index 0000000000..26d2fb0632
--- /dev/null
+++ b/modules/catalog-compaction/tech-notes/replicas-update.puml
@@ -0,0 +1,35 @@
+@startuml
+title Replicas update stage
+
+database Coordinator as crd
+database "Cluster node" as node
+queue replicas as replicas
+
+activate crd
+
+crd -> node ++ : MinimumTimesRequest
+node -> node
+note right
+  determine minimum begin time among all
+  active RW transactions started locally using
+  ActiveLocalTxMinimumBeginTimeProvider
+end note
+node -> crd -- : MinimumTimesResponse
+
+crd -> crd : compute global minimum tx time
+
+crd -> node ++ : PrepareUpdateTxBeginTimeMessage
+node ->> node
+note right
+  send local
+  UpdateMinimumActiveTxBeginTimeReplicaRequest
+  to local primary replicas
+end note
+node ->> replicas : UpdateMinimumActiveTxBeginTimeCommand
+replicas ->> replicas
+note right
+  publish timestamp when flushing
+  of the partition data is completed
+end note
+
+@enduml
\ No newline at end of file

Reply via email to