Hi Jordan, what would this look like from the implementation perspective? I was experimenting with transactional guardrails where an operator would control the content of a virtual table which would be backed by TCM so whatever guardrail we would change, this would be automatically and transparently propagated to every node in a cluster. The POC worked quite nicely. TCM is just a vehicle to commit a change which would spread around and all these settings would survive restarts. We would have the same configuration everywhere which is not currently the case because guardrails are configured per node and if not persisted to yaml, on restart their values would be forgotten.
Guardrails are just an example, what is quite obvious is to expand this idea to the whole configuration in yaml. Of course, not all properties in yaml make sense to be the same cluster-wise (ip addresses etc ...), but the ones which do would be again set everywhere the same way. The approach I described above is that we make sure that the configuration is same everywhere, hence there can be no misunderstanding what features this or that node has, if we say that all nodes have to have a particular feature because we said so in TCM log so on restart / replay, a node with "catch up" with whatever features it is asked to turn on. Your approach seems to be that we distribute what all capabilities / features a cluster supports and that each individual node configures itself in some way or not to comply? Is there any intersection in these approaches? At first sight it seems somehow related. How is one different from another from your point of view? Regards (1) https://issues.apache.org/jira/browse/CASSANDRA-19593 On Thu, Dec 19, 2024 at 12:00 AM Jordan West <jw...@apache.org> wrote: > In a recent discussion on the pains of upgrading one topic that came up is > a feature that Riak had called Capabilities [1]. A major pain with upgrades > is that each node independently decides when to start using new or modified > functionality. Even when we put this behind a config (like storage > compatibility mode) each node immediately enables the feature when the > config is changed and the node is restarted. This causes various types of > upgrade pain such as failed streams and schema disagreement. A > recent example of this is CASSANRA-20118 [2]. In some cases operators can > prevent this from happening through careful coordination (e.g. ensuring > upgrade sstables only runs after the whole cluster is upgraded) but > typically requires custom code in whatever control plane the operator is > using. A capabilities framework would distribute the state of what features > each node has (and their status e.g. enabled or not) so that the cluster > can choose to opt in to new features once the whole cluster has them > available. From experience, having this in Riak made upgrades a > significantly less risky process and also paved a path towards repeatable > downgrades. I think Cassandra would benefit from it as well. > > Further, other tools like analytics could benefit from having this > information since currently it's up to the operator to manually determine > the state of the cluster in some cases. > > I am considering drafting a CEP proposal for this feature but wanted to > take the general temperature of the community and get some early thoughts > while working on the draft. > > Looking forward to hearing y'alls thoughts, > Jordan > > [1] > https://github.com/basho/riak_core/blob/25d9a6fa917eb8a2e95795d64eb88d7ad384ed88/src/riak_core_capability.erl#L23-L72 > > [2] https://issues.apache.org/jira/browse/CASSANDRA-20118 >