Re: Capabilities

Štefan Miklošovič Thu, 19 Dec 2024 13:47:48 -0800

Hi Jordan,

what would this look like from the implementation perspective? I was
experimenting with transactional guardrails where an operator would control
the content of a virtual table which would be backed by TCM so whatever
guardrail we would change, this would be automatically and transparently
propagated to every node in a cluster. The POC worked quite nicely. TCM is
just a vehicle to commit a change which would spread around and all these
settings would survive restarts. We would have the same configuration
everywhere which is not currently the case because guardrails are
configured per node and if not persisted to yaml, on restart their values
would be forgotten.

Guardrails are just an example, what is quite obvious is to expand this
idea to the whole configuration in yaml. Of course, not all properties in
yaml make sense to be the same cluster-wise (ip addresses etc ...), but the
ones which do would be again set everywhere the same way.

The approach I described above is that we make sure that the configuration
is same everywhere, hence there can be no misunderstanding what features
this or that node has, if we say that all nodes have to have a particular
feature because we said so in TCM log so on restart / replay, a node with
"catch up" with whatever features it is asked to turn on.

Your approach seems to be that we distribute what all capabilities /
features a cluster supports and that each individual node configures itself
in some way or not to comply?

Is there any intersection in these approaches? At first sight it seems
somehow related. How is one different from another from your point of view?

Regards

(1) https://issues.apache.org/jira/browse/CASSANDRA-19593

On Thu, Dec 19, 2024 at 12:00 AM Jordan West <jw...@apache.org> wrote:

> In a recent discussion on the pains of upgrading one topic that came up is
> a feature that Riak had called Capabilities [1]. A major pain with upgrades
> is that each node independently decides when to start using new or modified
> functionality. Even when we put this behind a config (like storage
> compatibility mode) each node immediately enables the feature when the
> config is changed and the node is restarted. This causes various types of
> upgrade pain such as failed streams and schema disagreement. A
> recent example of this is CASSANRA-20118 [2]. In some cases operators can
> prevent this from happening through careful coordination (e.g. ensuring
> upgrade sstables only runs after the whole cluster is upgraded) but
> typically requires custom code in whatever control plane the operator is
> using. A capabilities framework would distribute the state of what features
> each node has (and their status e.g. enabled or not) so that the cluster
> can choose to opt in to new features once the whole cluster has them
> available. From experience, having this in Riak made upgrades a
> significantly less risky process and also paved a path towards repeatable
> downgrades. I think Cassandra would benefit from it as well.
>
> Further, other tools like analytics could benefit from having this
> information since currently it's up to the operator to manually determine
> the state of the cluster in some cases.
>
> I am considering drafting a CEP proposal for this feature but wanted to
> take the general temperature of the community and get some early thoughts
> while working on the draft.
>
> Looking forward to hearing y'alls thoughts,
> Jordan
>
> [1]
> https://github.com/basho/riak_core/blob/25d9a6fa917eb8a2e95795d64eb88d7ad384ed88/src/riak_core_capability.erl#L23-L72
>
> [2] https://issues.apache.org/jira/browse/CASSANDRA-20118
>

Re: Capabilities

Reply via email to