Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Sam Tunnicliffe Tue, 06 Sep 2022 11:55:23 -0700

Hi Jacek, 

Thanks for the great questions, they certainly all relate to things we 
considered, so I hope I can answer them in a coherent way!

>  will it be a replica group explicitly associated with each event

Some (but not all) events are explicitly associated with a replica group (or 
more accurately with the superset of present and future replica groups). Where 
this is the case, the acknowlegment of these events by a majority of the 
involved replicas is required for the "larger" process of which the event is 
part, to make progress. This is not a global watermark though, which would halt 
progress across the cluster, it only affects the specific multistep operation 
it is part of. 

In such an operation (say a bootstrap or decommission), only one actor is 
actually in control of moving forward through the process, all the other nodes 
in the cluster simply apply the relevant metadata updates locally. It is the 
progress of this primary actor which is gated on the acknowledgments. In the 
bootstrap case, the joining node itself drives the process. 

The joining node will submit the first event to the CMS, which hopefully is 
accepted (because it would violate no invariants on the cluster metadata) and 
becomes committed. That joining node will then await notification from the CMS 
that a majority of the relevant peers have acked the event. Until it receives 
that, it will not submit the event representing the next step in the operation. 
By the same mechanism, it will not perform other aspects of its bootstrap until 
the preceding metadata change is acked (i.e. it won't initiate streaming until 
the step which adds it to the write groups - making it a pending node - is 
acked).

Other metadata changes can certainly be occurring while this is going on. 
Another joining node may start submitting similar events, and as long as the 
operation is permitted, that process will progress concurrently. In order to 
ensure that these multistep operations are safe to execute concurrently, we 
reject submissions which affect ranges already being affected by an in-flight 
operation. Essentially, you can safely run concurrent bootstraps provided the 
nodes involved do not share replicated ranges.

> For multistep actions - are they going to be added all or none? If they are 
> added one by one, can they be interleaved with other multistep actions?

As you can see from the above, the steps are committed to the log one at a time 
and multistep operations can be interleaved. 

However, at the point of executing the first step, the plan for executing the 
rest of the steps is known. We persist this in-progress operation plan in the 
cluster metadata as an effect of executing the first step - note that this is 
different from actually committing the steps to the log itself, the pending 
steps do not yet have an order assigned (and may never get one). This 
persistence of in-progress operations is to enable an operation to be resumed 
if the node driving it were to restart part way through. 

> What if a node(s) failure prevents progress over the log? For example, we are 
> unable to get a majority of nodes which process an event so we cannot move 
> forward. We cannot remove those nodes though, because the removal will be 
> later in the log and we cannot make progress.

It's the specific operation that's unable to make progress, but other metadata 
updates can proceed. To make this concrete: you're trying to join a new node to 
the cluster, but are unable to because some affected replicas are down and so 
cannot acknowledge one of the steps. If the replicas are temporarily down, 
bringing them back up would be sufficient to resume the join. If they are 
permanently unavailable, in order to preserve consistency, you need to cancel 
the ongoing join, replace them and restart the join from scratch.    

Cancelling an in-progress operation like a join is a matter of reverting the 
metadata changes made by any of the steps which have already been committed, 
including the persistence of the aforementioned pending steps. In the proposal, 
we've suggested an operator should be involved in this, but that would be 
something trivial like running a nodetool command to submit the necessary event 
to the log. It may be possible to automate that, but I would prefer to omit it 
initially, not least to keep the scope manageable. Of course, nothing would 
preclude an external monitoring system to run the nodetool command if it's 
trusted to accurately detect such failures. 

> This sounds like an implementation of everywhere replication strategy, 
> doesn't it?

It does sound similar, but fear not, it isn't quite the same. The "everywhere" 
here is limited to the CMS nodes, which are only a small subset of the cluster. 
Essentially, it just means that all the (current) CMS members own and replicate 
the entire event log and so when a node joins the CMS it bootstraps the 
entirety of the current log state (note: this needn't be the full log history, 
just a snapshot plus subsequent entries). 

> On 6 Sep 2022, at 15:24, Jacek Lewandowski <lewandowski.ja...@gmail.com> 
> wrote:
> 
> Hi Sam, this is a great idea and a really well described CEP!
> 
> I have some questions, perhaps they reflect my weak understanding, but maybe 
> you can answer:
> Is it going to work so that each node reads the log individually and try to 
> catch up in a way that it applies a transition locally once the previous 
> change is confirmed on the majority of the affected nodes, right? If so, will 
> it be a replica group explicitly associated with each event (explicitly 
> mentioned nodes which are affected by the change and a list of those which 
> already applied the change, so that each node individually can make a 
> decision whether to move forward?). If so, can the node skip a transformation 
> which does not affect it and move forward thus making another change 
> concurrently?
> 
> 
> What if a node(s) failure prevents progress over the log? For example, we are 
> unable to get a majority of nodes which process an event so we cannot move 
> forward. We cannot remove those nodes though, because the removal will be 
> later in the log and we cannot make progress. I've read about manual 
> intervention but maybe it can be avoided in some cases for example by adding 
> no more than one pending event to the log?
> 
> For multistep actions - are they going to be added all or none? If they are 
> added one by one, can they be interleaved with other multistep actions?
> 
> Reconfiguration itself occurs using the process that is analogous to 
> "regular" bootstrap and also uses Paxos as a linearizability mechanism, 
> except for there is no concept of "token" ownership in CMS; all CMS nodes own 
> an entire range from MIN to MAX token. This means that during bootstrap, we 
> do not have to split ranges, or have some nodes "lose" a part of the ring...
> 
> This sounds like an implementation of everywhere replication strategy, 
> doesn't it?
> 
> 
> - - -- --- ----- -------- -------------
> Jacek Lewandowski
> 
> 
> On Tue, Sep 6, 2022 at 9:19 AM Sam Tunnicliffe <s...@beobal.com 
> <mailto:s...@beobal.com>> wrote:
> >
> >
> >
> > On 5 Sep 2022, at 22:02, Henrik Ingo <henrik.i...@datastax.com 
> > <mailto:henrik.i...@datastax.com>> wrote:
> >
> > Mostly I just wanted to ack that at least someone read the doc (somewhat 
> > superficially sure, but some parts with thought...)
> >
> >
> > Thanks, it's a lot to digest, so we appreciate that people are working 
> > through it. 
> >>
> >> One pre-feature that we would include in the preceding minor release is a 
> >> node level switch to disable all operations that modify cluster metadata 
> >> state. This would include schema changes as well as topology-altering 
> >> events like move, decommission or (gossip-based) bootstrap and would be 
> >> activated on all nodes for the duration of the major upgrade. If this 
> >> switch were accessible via internode messaging, activating it for an 
> >> upgrade could be automated. When an upgraded node starts up, it could send 
> >> a request to disable metadata changes to any peer still running the old 
> >> version. This would cost a few redundant messages, but simplify things 
> >> operationally.
> >>
> >> Although this approach would necessitate an additional minor version 
> >> upgrade, this is not without precedent and we believe that the benefits 
> >> outweigh the costs of additional operational overhead.
> >
> >
> > Sounds like a great idea, and probably necessary in practice?
> >  
> >
> >
> > Although I think we _could_ manage without this, it would certainly 
> > simplify this and future upgrades.
> >>
> >> If this part of the proposal is accepted, we could also include further 
> >> messaging protocol changes in the minor release, as these would largely 
> >> constitute additional verbs which would be implemented with no-op verb 
> >> handlers initially. This would simplify the major version code, as it 
> >> would not need to gate the sending of asynchronous replication messages on 
> >> the receiver's release version. During the migration, it may be useful to 
> >> have a way to directly inject gossip messages into the cluster, in case 
> >> the states of the yet-to-be upgraded nodes become inconsistent. This isn't 
> >> intended, so such a tool may never be required, but we have seen that 
> >> gossip propagation can be difficult to reason about at times.
> >
> >
> > Others will know the code better and I understand that adding new no-op 
> > verbs can be considered safe... But instinctively a bit hesitant on this 
> > one. Surely adding a few if statements to the upgraded version isn't that 
> > big of a deal?
> >
> > Also, it should make sense to minimize the dependencies from the previous 
> > major version (without CEP-21) to the new major version (with CEP-21). If a 
> > bug is found, it's much easier to fix code in the new major version than 
> > the old and supposedly stable one.
> >
> >
> > Yep, agreed. Adding verb handlers in advance may not buy us very much, so 
> > may not be worth the risk of additionally perturbing the stable system. I 
> > would say that having a means to directly manipulate gossip state during 
> > the upgrade would be a useful safety net in case something unforeseen 
> > occurs and we need to dig ourselves out of a hole. The precise scope of the 
> > feature & required changes are not something we've given extensive thought 
> > to yet, so we'd want to assess that carefully before proceeding.
> >
> > henrik
> >
> > --
> > Henrik Ingo
> > +358 40 569 7354
> >

Re: [DISCUSS] CEP-21: Transactional Cluster Metadata

Reply via email to