This is an automated email from the ASF dual-hosted git repository.

samt pushed a commit to branch cep-21-tcm
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/cep-21-tcm by this push:
     new 357ca7c2e9 Add implementation overview doc
357ca7c2e9 is described below

commit 357ca7c2e9e0515dd8f68821a81cbe820b4bf026
Author: Alex Petrov <[email protected]>
AuthorDate: Wed Oct 18 17:36:57 2023 +0100

    Add implementation overview doc
---
 .../org/apache/cassandra/tcm/TCM_implementation.md | 57 ++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/src/java/org/apache/cassandra/tcm/TCM_implementation.md 
b/src/java/org/apache/cassandra/tcm/TCM_implementation.md
new file mode 100644
index 0000000000..58cf01d5e3
--- /dev/null
+++ b/src/java/org/apache/cassandra/tcm/TCM_implementation.md
@@ -0,0 +1,57 @@
+# TCM Implementation Details
+
+This document will walk you through the core classes involved in Transactional 
Cluster Metadata. It describes a process of a node bringup into the existing 
TCM cluster. Each section will be prefixed by the header holding key classes 
that are used/described in the setion.
+
+## Startup: Discovery, Vote
+
+Boot process in TCM is very similar to the previously existing one, but is now 
split into several different classes rather than being mostly in 
`StorageService`. At first, `ClusterMetadataService` is initialized using 
`Startup#initialize`. Node determines its startup mode, which will be `Vote` in 
a usual case, which means that the node will initialize itself as a non-CMS 
node and will attempt to discover an existing CMS service or, failing that, 
participate in a vote to establish a new o [...]
+
+Node then continues startup, and eventually gets to 
`StorageService#initServer`, where it, among other things, gossips with CMS 
nodes to get a fresh view of the cluster for FD purposes, and then waits for 
Gossip to settle (again, for FD purposes).
+
+## Registration: ClusterMetadata, Transformation, RemoteProcessor
+
+Before joining the ring, the node has to register in order to obtain `NodeId`, 
which happens in `Register#maybeRegister`. Registration happens by committing a 
`Register` transformation using `ClusterMetadataService#commit` method. 
`Register` and other transformations are side-effect free functions mapping an 
instance of immutable `ClusterMetadata` to next `ClusterMetadata`. 
`ClusterMetadata` holds all information about cluster: directory of registered 
nodes, schema, node states and data  [...]
+
+Since the node executing register is not a CMS node, it is going to use a 
`RemoteProcessor` in order to perform this commit. `RemoteProcessor` is a 
simple RPC tool that serializes transformation and attempts to execute it by 
contacting CMS nodes and sending them `TCM_COMMIT_REQ`.
+
+## Commit Request: PaxosBackedProcessor, DistributedMetadataLogKeyspace, Retry
+
+When a CMS node receives a commit request, it deserializes and attempts to 
execute the transformation using `PaxosBackedProcessor`. Paxos backed processor 
stores an entire cluster metadata log in the 
`cluster_metadata.``distributed_metadata_log` table. It performs a simple CAS 
LWT that attempts to append a new entry to the log with an `Epoch` that is 
strictly consecutive to the last one. `Epoch` is a monotonically incrementing 
counter of `ClusterMetadata` versions.
+
+Both remote and paxos-backed processors are using `Retry` class for managing 
retries. Remote processor sets a deadline for its retries using 
`tcm_await_timeout`. CMS-local processor permits itself to use at most 
`tcm_rpc_timeout` for its attempts to retry.
+
+`PaxosBackedProcessor` then attempts to execute `Transformation`. Result of 
the execution can be either `Success` or `Reject`. `Reject`s are not persisted 
in the log, and are linearized using a read that confirms that transformation 
was executed against the highest epoch. Examples of `Reject`s are validation 
errors, exceptions encountered while attempting to execute transformation, etc. 
For example, `Register` would return a rejection if a node with the same IP 
address already exists in  [...]
+
+## Commit Response: Entry, LocalLog
+
+After `PaxosBackedProcessor` suceeds with committing the entry to the 
distributed log, it broadcasts the commit result that contains `Entry` holding 
newly appended transformation to the rest of the cluster using `Replicator` 
(which simply iterates all nodes in the directory, informing them about the new 
epoch). This operation does not need to be reliable and has no retries. In 
other words, if a node was down during CMS attempt to replicate entries to it, 
it will inevitably learn about th [...]
+
+Along with committed `Entry`, the response from CMS to the peer which 
submitted it also contains all entries that will allow the node that has 
initiated the commit to fully catch up to the epoch enacted by the committed 
transformation.
+
+When `RemoteProcessor` receives a response from CMS node, it appends all 
received entries to the `LocalLog`. `LocalLog` processes the backlog of pending 
entries and enacts a new epoch by constructing new `ClusterMetadata`.
+
+## Bootstrap: InProgressSequence, PrepareJoin, BootstrapAndJoin
+
+At that point, the node is ready to start the process of joining the ring. It 
begins in `Startup#startup`. `Startup#getInitialTransformation` determines that 
the node should start regular bootstrap process (as opposed to replace), and 
the node proceeds with commit of `PrepareJoin` transformation. During 
`PrepareJoin`, `ClusterMetadata` is changed in the following ways:
+
+* Ranges that will be affected by the bootstrap of the node are locked (see 
`LockedRanges`)
+    * If computed locked ranges intersect with ranges that were locked before 
this transformation got executed, `PrepareJoin` is rejected.
+* `InProgressSequence`, holding the three transformations (`PrepareJoin`, 
`MinJoin` `FinishJoin`), is computed and added to `InProgressSequences` map.
+    * If any in-progress sequences associated with the current node are 
present, `PrepareJoin` is rejected.
+* `AffectedRanges`, ranges whose placements are going to be changed while 
executing this sequence, are computed and returned as a part of commit success 
message.
+
+`InProgressSequence` is then executed step-by step. All local operations that 
the node has to perform between executing these steps are implemented as a part 
of the in-progress sequence (see `BootstrapAndJoin#executeNext`). We make *no 
assumptions* about liveness of the node between execution of in-progress 
sequence steps. For example, the node may crash after executing `PrepareJoin` 
but before it updates tokens in the local keyspace. So the only assumption we 
make is that `SystemKeyspac [...]
+
+In order to ensure quorum consistency, before executing each next step, the 
node has to await on the `ProgressBarrier`. CEP-21 contains a detailed 
explanation about why progress barriers are necessary. For the purpose of this 
document, it suffices to say that majority of owners of the `AffectedRanges` 
have to learn about the epoch enacting the previous step before each next step 
can be executed. This is done in order to preserve replication factor for 
eventually consistent queries.
+
+Upon executing all steps in the progress sequence, ranges are unlocked, and 
sequence itself is removed from `ClusterMetadata`.
+
+
+## Querying: CoordinatorBehindException, FetchPeerLog, FetchCMSLog
+
+As the node starts participating in reads and writes, it may happen that its 
view of the ring or schema becomes divergent from other nodes. TCM makes best 
effort to minimize the time window of this happening, but in a distributed 
system at least some delay is inevitable. TCM solves this problem by including 
the highest `Epoch` known by the node in every request that the node 
coordinates, and in every response to the coordinator when serving as a replica.
+
+Replicas can check the schema and ring consistency of the *current* request by 
comparing the `Epoch` that coordinator has with the epoch when schema was last 
modified, and when the placements for the given range were last modified. If it 
happens that the replica knows that coordinator couldn’t have known about 
either schema, or the ring, it will throw `CoordinatorBehindException`. In all 
other cases (i.e. when either coordinator, or the replica are aware of the 
higher `Epoch`, but existe [...]
+
+After coordinator has collected enough responses, it compares its `Epoch` with 
the `Epoch` that was used to construct the `ReplicaPlan` for the query it is 
coordinating. If epochs are different, it checks if collected replica responses 
still correspond to the consistency level query was executed at.
+


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to