Hi Jacek, Thanks for the great questions, they certainly all relate to things we considered, so I hope I can answer them in a coherent way!
> will it be a replica group explicitly associated with each event Some (but not all) events are explicitly associated with a replica group (or more accurately with the superset of present and future replica groups). Where this is the case, the acknowlegment of these events by a majority of the involved replicas is required for the "larger" process of which the event is part, to make progress. This is not a global watermark though, which would halt progress across the cluster, it only affects the specific multistep operation it is part of. In such an operation (say a bootstrap or decommission), only one actor is actually in control of moving forward through the process, all the other nodes in the cluster simply apply the relevant metadata updates locally. It is the progress of this primary actor which is gated on the acknowledgments. In the bootstrap case, the joining node itself drives the process. The joining node will submit the first event to the CMS, which hopefully is accepted (because it would violate no invariants on the cluster metadata) and becomes committed. That joining node will then await notification from the CMS that a majority of the relevant peers have acked the event. Until it receives that, it will not submit the event representing the next step in the operation. By the same mechanism, it will not perform other aspects of its bootstrap until the preceding metadata change is acked (i.e. it won't initiate streaming until the step which adds it to the write groups - making it a pending node - is acked). Other metadata changes can certainly be occurring while this is going on. Another joining node may start submitting similar events, and as long as the operation is permitted, that process will progress concurrently. In order to ensure that these multistep operations are safe to execute concurrently, we reject submissions which affect ranges already being affected by an in-flight operation. Essentially, you can safely run concurrent bootstraps provided the nodes involved do not share replicated ranges. > For multistep actions - are they going to be added all or none? If they are > added one by one, can they be interleaved with other multistep actions? As you can see from the above, the steps are committed to the log one at a time and multistep operations can be interleaved. However, at the point of executing the first step, the plan for executing the rest of the steps is known. We persist this in-progress operation plan in the cluster metadata as an effect of executing the first step - note that this is different from actually committing the steps to the log itself, the pending steps do not yet have an order assigned (and may never get one). This persistence of in-progress operations is to enable an operation to be resumed if the node driving it were to restart part way through. > What if a node(s) failure prevents progress over the log? For example, we are > unable to get a majority of nodes which process an event so we cannot move > forward. We cannot remove those nodes though, because the removal will be > later in the log and we cannot make progress. It's the specific operation that's unable to make progress, but other metadata updates can proceed. To make this concrete: you're trying to join a new node to the cluster, but are unable to because some affected replicas are down and so cannot acknowledge one of the steps. If the replicas are temporarily down, bringing them back up would be sufficient to resume the join. If they are permanently unavailable, in order to preserve consistency, you need to cancel the ongoing join, replace them and restart the join from scratch. Cancelling an in-progress operation like a join is a matter of reverting the metadata changes made by any of the steps which have already been committed, including the persistence of the aforementioned pending steps. In the proposal, we've suggested an operator should be involved in this, but that would be something trivial like running a nodetool command to submit the necessary event to the log. It may be possible to automate that, but I would prefer to omit it initially, not least to keep the scope manageable. Of course, nothing would preclude an external monitoring system to run the nodetool command if it's trusted to accurately detect such failures. > This sounds like an implementation of everywhere replication strategy, > doesn't it? It does sound similar, but fear not, it isn't quite the same. The "everywhere" here is limited to the CMS nodes, which are only a small subset of the cluster. Essentially, it just means that all the (current) CMS members own and replicate the entire event log and so when a node joins the CMS it bootstraps the entirety of the current log state (note: this needn't be the full log history, just a snapshot plus subsequent entries). > On 6 Sep 2022, at 15:24, Jacek Lewandowski <lewandowski.ja...@gmail.com> > wrote: > > Hi Sam, this is a great idea and a really well described CEP! > > I have some questions, perhaps they reflect my weak understanding, but maybe > you can answer: > Is it going to work so that each node reads the log individually and try to > catch up in a way that it applies a transition locally once the previous > change is confirmed on the majority of the affected nodes, right? If so, will > it be a replica group explicitly associated with each event (explicitly > mentioned nodes which are affected by the change and a list of those which > already applied the change, so that each node individually can make a > decision whether to move forward?). If so, can the node skip a transformation > which does not affect it and move forward thus making another change > concurrently? > > > What if a node(s) failure prevents progress over the log? For example, we are > unable to get a majority of nodes which process an event so we cannot move > forward. We cannot remove those nodes though, because the removal will be > later in the log and we cannot make progress. I've read about manual > intervention but maybe it can be avoided in some cases for example by adding > no more than one pending event to the log? > > For multistep actions - are they going to be added all or none? If they are > added one by one, can they be interleaved with other multistep actions? > > Reconfiguration itself occurs using the process that is analogous to > "regular" bootstrap and also uses Paxos as a linearizability mechanism, > except for there is no concept of "token" ownership in CMS; all CMS nodes own > an entire range from MIN to MAX token. This means that during bootstrap, we > do not have to split ranges, or have some nodes "lose" a part of the ring... > > This sounds like an implementation of everywhere replication strategy, > doesn't it? > > > - - -- --- ----- -------- ------------- > Jacek Lewandowski > > > On Tue, Sep 6, 2022 at 9:19 AM Sam Tunnicliffe <s...@beobal.com > <mailto:s...@beobal.com>> wrote: > > > > > > > > On 5 Sep 2022, at 22:02, Henrik Ingo <henrik.i...@datastax.com > > <mailto:henrik.i...@datastax.com>> wrote: > > > > Mostly I just wanted to ack that at least someone read the doc (somewhat > > superficially sure, but some parts with thought...) > > > > > > Thanks, it's a lot to digest, so we appreciate that people are working > > through it. > >> > >> One pre-feature that we would include in the preceding minor release is a > >> node level switch to disable all operations that modify cluster metadata > >> state. This would include schema changes as well as topology-altering > >> events like move, decommission or (gossip-based) bootstrap and would be > >> activated on all nodes for the duration of the major upgrade. If this > >> switch were accessible via internode messaging, activating it for an > >> upgrade could be automated. When an upgraded node starts up, it could send > >> a request to disable metadata changes to any peer still running the old > >> version. This would cost a few redundant messages, but simplify things > >> operationally. > >> > >> Although this approach would necessitate an additional minor version > >> upgrade, this is not without precedent and we believe that the benefits > >> outweigh the costs of additional operational overhead. > > > > > > Sounds like a great idea, and probably necessary in practice? > > > > > > > > Although I think we _could_ manage without this, it would certainly > > simplify this and future upgrades. > >> > >> If this part of the proposal is accepted, we could also include further > >> messaging protocol changes in the minor release, as these would largely > >> constitute additional verbs which would be implemented with no-op verb > >> handlers initially. This would simplify the major version code, as it > >> would not need to gate the sending of asynchronous replication messages on > >> the receiver's release version. During the migration, it may be useful to > >> have a way to directly inject gossip messages into the cluster, in case > >> the states of the yet-to-be upgraded nodes become inconsistent. This isn't > >> intended, so such a tool may never be required, but we have seen that > >> gossip propagation can be difficult to reason about at times. > > > > > > Others will know the code better and I understand that adding new no-op > > verbs can be considered safe... But instinctively a bit hesitant on this > > one. Surely adding a few if statements to the upgraded version isn't that > > big of a deal? > > > > Also, it should make sense to minimize the dependencies from the previous > > major version (without CEP-21) to the new major version (with CEP-21). If a > > bug is found, it's much easier to fix code in the new major version than > > the old and supposedly stable one. > > > > > > Yep, agreed. Adding verb handlers in advance may not buy us very much, so > > may not be worth the risk of additionally perturbing the stable system. I > > would say that having a means to directly manipulate gossip state during > > the upgrade would be a useful safety net in case something unforeseen > > occurs and we need to dig ourselves out of a hole. The precise scope of the > > feature & required changes are not something we've given extensive thought > > to yet, so we'd want to assess that carefully before proceeding. > > > > henrik > > > > -- > > Henrik Ingo > > +358 40 569 7354 > >