[ 
https://issues.apache.org/jira/browse/CASSANDRA-18330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769190#comment-17769190
 ] 

Sam Tunnicliffe commented on CASSANDRA-18330:
---------------------------------------------

As people are starting to dig into the implementation in more detail, here are 
some pointers for getting to grips with the design and implementation.  

The [CEP 
document|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata]
 is still valid as high level design doc. The actual implementation is still 
essentially reflective of the 
[Implementation|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-Implementation]
 section in this doc. The only thing which we expect to change substantially is 
the [CMS Reconfiguration 
Protocol|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-CMSReconfigurationProtocol];
 the current implementation in 
[{{cep-21-tcm}}|https://github.com/apache/cassandra/tree/cep-21-tcm] is super 
simple and requires an operator to execute a nodetool command to add or remove 
each node from the CMS. In an upcoming patch, we plan to make that more 
declarative, so an operator can specify the desired topology of the CMS, 
similar to configuring a keyspace with {{NetworkTopologyStrategy}}.

We've tried to keep the external/operator interfaces to C* unchanged under TCM, 
so there isn't a lot to document there. However, there are a couple of minor 
changes to be aware of. We're working on documentation new nodetool commands, 
metrics, virtual tables and runbooks for operators, but in the meantime the 
most relevant point is to do with upgrading.
Upgrading involves a separate step to enable TCM after all the nodes in the 
cluster have been brought up on the new version. This involves running 
{{nodetool addtocms}} on one node, which will make it the first node in the CMS 
which will trigger all peers to start full TCM operation. Until this point, 
user traffic (reads/writes) function as normal, but metadata-changing 
operations (token move, bootstrap, decommission, schema changes, etc) are 
rejected. 
If storage compatibility mode is enabled, then some aspects of TCM (namely the 
atomic visibility property) are not available. 
Standing up a brand new cluster doesn't require this step, the nodes themselves 
will identify & elect a first CMS member automatically.
In both cases, metadata-changing operations will be permitted as soon as the 
initial CMS migration is complete, at that point the CMS will only contain a 
single member, which is clearly not suitable for real clusters. To add more 
members to the CMS, just run the {{addtocms}} nodetool command on those nodes.
As mentioned above, this will change very soon.

Onto the code itself, almost all new code is located in the 
{{org.apache.cassandra.tcm package}}. Logical starting points are 
{{ClusterMetadata}} and {{ClusterMetadataService}}.
* {{ClusterMetadata}} is the new data structure which contains all of the state 
which is versioned and globally managed. It includes schema, membership and 
ownership info (what was previously the responsiblity of {{TokenMetadata}}) and 
the state of any ongoing cluster operations (like bootstraps/decommissions).
* Changes to {{ClusterMetadata}} are performed by submitting requests to the 
Cluster Metadata Service. The class {{ClusterMetadataService}} acts as a facade 
for this, so it acts locally on nodes which are members of the CMS and remotely 
for all other nodes. See the {{o.a.c.tcm.Processor}} interface and its 2 
(non-test) implementations.
* The [Event Submission 
Protocol|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-EventSubmissionProtocol]
 section in the CEP describes how the the CMS handles these changes, but 
essentially they are validated in respect to the current state of the 
{{ClusterMetadata}} and either accepted, which assigns a total order to the 
changes and commits them to being applied, or rejected.
* Committed modifications are replicated to all peers, which apply them to 
their own local {{ClusterMetadata}} in strict sequence. The class 
{{o.a.c.tcm.log.LocalLog}} is responsible for processing the replicated log 
entries, applying the {{Transformations}} they contain to the current 
{{ClusterMetadata}} then publishing the result and notifying any listeners. 
This is how we ensure consistent ordering, but it doesn't enforce atomic 
visiblity.
* For that, see the section in the CEP titled [Read/Write Path 
Integration|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-Read/WritePathIntegration].
 Essentially, message formats & verb handlers have been modified so that peers 
participating in read/write requests exchange information about their current 
metadata version. If any participant is lagging behind in a material way, they 
either catch up before proceeding or they signal to the coordinator that the 
operation's consistency cannot be guaranteed. As this requires the latest 
messaging version, running with storage compatibility mode on drops this 
guarantee.
* Gossip still exists in TCM, but it no longer plays a role in disseminating 
ownership/membership info or schema versions. It is still used for liveness (no 
real changes to {{FailureDetector}}) and "transient" state (such as load, rpc 
readiness, etc). For compatibility with tools etc which rely on gossip info we 
have preserved all {{ApplicationStates}}, but where possible these are now 
populated from {{ClusterMetadata}}.




> Delivery of CEP-21: Transactional Cluster Metadata
> --------------------------------------------------
>
>                 Key: CASSANDRA-18330
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18330
>             Project: Cassandra
>          Issue Type: Epic
>          Components: Cluster/Membership, Cluster/Schema
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to