[
https://issues.apache.org/jira/browse/CASSANDRA-18330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769190#comment-17769190
]
Sam Tunnicliffe commented on CASSANDRA-18330:
---------------------------------------------
As people are starting to dig into the implementation in more detail, here are
some pointers for getting to grips with the design and implementation.
The [CEP
document|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata]
is still valid as high level design doc. The actual implementation is still
essentially reflective of the
[Implementation|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-Implementation]
section in this doc. The only thing which we expect to change substantially is
the [CMS Reconfiguration
Protocol|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-CMSReconfigurationProtocol];
the current implementation in
[{{cep-21-tcm}}|https://github.com/apache/cassandra/tree/cep-21-tcm] is super
simple and requires an operator to execute a nodetool command to add or remove
each node from the CMS. In an upcoming patch, we plan to make that more
declarative, so an operator can specify the desired topology of the CMS,
similar to configuring a keyspace with {{NetworkTopologyStrategy}}.
We've tried to keep the external/operator interfaces to C* unchanged under TCM,
so there isn't a lot to document there. However, there are a couple of minor
changes to be aware of. We're working on documentation new nodetool commands,
metrics, virtual tables and runbooks for operators, but in the meantime the
most relevant point is to do with upgrading.
Upgrading involves a separate step to enable TCM after all the nodes in the
cluster have been brought up on the new version. This involves running
{{nodetool addtocms}} on one node, which will make it the first node in the CMS
which will trigger all peers to start full TCM operation. Until this point,
user traffic (reads/writes) function as normal, but metadata-changing
operations (token move, bootstrap, decommission, schema changes, etc) are
rejected.
If storage compatibility mode is enabled, then some aspects of TCM (namely the
atomic visibility property) are not available.
Standing up a brand new cluster doesn't require this step, the nodes themselves
will identify & elect a first CMS member automatically.
In both cases, metadata-changing operations will be permitted as soon as the
initial CMS migration is complete, at that point the CMS will only contain a
single member, which is clearly not suitable for real clusters. To add more
members to the CMS, just run the {{addtocms}} nodetool command on those nodes.
As mentioned above, this will change very soon.
Onto the code itself, almost all new code is located in the
{{org.apache.cassandra.tcm package}}. Logical starting points are
{{ClusterMetadata}} and {{ClusterMetadataService}}.
* {{ClusterMetadata}} is the new data structure which contains all of the state
which is versioned and globally managed. It includes schema, membership and
ownership info (what was previously the responsiblity of {{TokenMetadata}}) and
the state of any ongoing cluster operations (like bootstraps/decommissions).
* Changes to {{ClusterMetadata}} are performed by submitting requests to the
Cluster Metadata Service. The class {{ClusterMetadataService}} acts as a facade
for this, so it acts locally on nodes which are members of the CMS and remotely
for all other nodes. See the {{o.a.c.tcm.Processor}} interface and its 2
(non-test) implementations.
* The [Event Submission
Protocol|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-EventSubmissionProtocol]
section in the CEP describes how the the CMS handles these changes, but
essentially they are validated in respect to the current state of the
{{ClusterMetadata}} and either accepted, which assigns a total order to the
changes and commits them to being applied, or rejected.
* Committed modifications are replicated to all peers, which apply them to
their own local {{ClusterMetadata}} in strict sequence. The class
{{o.a.c.tcm.log.LocalLog}} is responsible for processing the replicated log
entries, applying the {{Transformations}} they contain to the current
{{ClusterMetadata}} then publishing the result and notifying any listeners.
This is how we ensure consistent ordering, but it doesn't enforce atomic
visiblity.
* For that, see the section in the CEP titled [Read/Write Path
Integration|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-Read/WritePathIntegration].
Essentially, message formats & verb handlers have been modified so that peers
participating in read/write requests exchange information about their current
metadata version. If any participant is lagging behind in a material way, they
either catch up before proceeding or they signal to the coordinator that the
operation's consistency cannot be guaranteed. As this requires the latest
messaging version, running with storage compatibility mode on drops this
guarantee.
* Gossip still exists in TCM, but it no longer plays a role in disseminating
ownership/membership info or schema versions. It is still used for liveness (no
real changes to {{FailureDetector}}) and "transient" state (such as load, rpc
readiness, etc). For compatibility with tools etc which rely on gossip info we
have preserved all {{ApplicationStates}}, but where possible these are now
populated from {{ClusterMetadata}}.
> Delivery of CEP-21: Transactional Cluster Metadata
> --------------------------------------------------
>
> Key: CASSANDRA-18330
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18330
> Project: Cassandra
> Issue Type: Epic
> Components: Cluster/Membership, Cluster/Schema
> Reporter: Sam Tunnicliffe
> Assignee: Sam Tunnicliffe
> Priority: Normal
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]