Re: Paxos repairs in CEP-14

bened...@apache.org Sun, 05 Dec 2021 11:59:42 -0800

> What is hard to wrap my head around is how a
given partition can participate in a successful Paxos transaction even if
it might be completely unaware of the previous transaction to the same
partition.


The partition is aware of every prior transaction, but no specific replica is 
necessarily aware of any given preceding transaction. A quorum of data (or data 
+ digest) responses is needed for the coordinator to assemble the prior state, 
in order to compute the new state.

> I can in any event stream commit logs from a majority of replicas, merge 
> them, and such a merged log must contain all committed transactions to that 
> partition

Yes. This was essentially true prior to CEP-14, however. The only real 
difference is that “committed” includes those that have been acknowledged to 
clients with an asynchronous commit phase, i.e. if the commit consistency level 
is e.g. LOCAL_QUORUM, ONE or ANY, thereby reducing the number of necessary 
round-trips by one. Today any transaction that has completed the commit phase 
would be visible in the commit log, and if your commit is synchronous (i.e. 
uses the default QUORUM consistency for SERIAL writes) then a majority of 
commit logs will contain the transaction.

While on the topic, something the community may want to consider is how we 
message CEP-14, and how it might affect default behaviour. Currently, by 
default, commit consistency will remain QUORUM after CEP-14. But once CEP-14 is 
fully enabled on a cluster we might want to default to LOCAL_QUORUM. We might 
want to consider if we want to automatically enable any elements of CEP-14 
also, given that they fix various consistency issues as well as performance 
issues.

From: Henrik Ingo <henrik.i...@datastax.com>
Date: Sunday, 5 December 2021 at 19:24
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: Paxos repairs in CEP-14
On Sun, 5 Dec 2021, 18.40 bened...@apache.org, <bened...@apache.org> wrote:

> > And at the end of the repair, this lower bound is known and stored
> somewhere?
>
> Yes, there is a new system.paxos_repair_history table
>
> > Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas?
>
> All in progress LWTs are flushed, essentially. They are either completed
> or invalidated. So there is a synchronisation point for the range being
> repaired, but there is no impact on any completed transactions. So even if
> paxos repair successfully sync’d all in progress transactions to every
> node, there could still be some past transactions that were persisted only
> to a majority of nodes, and these will be invisible to the paxos repair
> mechanism.


Cool. This clarifies.


There is no transaction log today in Cassandra to sync, so repair of the
> underlying data table is still the only way to guarantee data is
> synchronised to every node.
>

It's not the transaction log as such that I'm missing. (Or it is, but I
understand there isn't one.) What is hard to wrap my head around is how a
given partition can participate in a successful Paxos transaction even if
it might be completely unaware of the previous transaction to the same
partition. At least this is how I've understood this conversation?


> CEP-15 will change this, so that nodes will be fully consistent up to some
> logical timestamp, but CEP-14 does not change the underlying semantics of
> LWTs and Paxos in Cassandra.
>

Yes, looking forward to that. I just wanted to check whether CEP-14 would
possibly contain aome per partition version of the same ideas.

But even with everything you've explained, did I understand correctly that
(focusing on a single partition and only LWT writes...) I can in any event
stream commit logs from a majority of replicas, merge them, and such a
merged log must contain all committed transactions to that partition. (And
this should have nothing to do with the repair, then?)

Henrik



>
>
>
>
> From: Henrik Ingo <henrik.i...@datastax.com>
> Date: Sunday, 5 December 2021 at 11:45
> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> Subject: Re: Paxos repairs in CEP-14
> On Sun, 5 Dec 2021, 1.45 bened...@apache.org, <bened...@apache.org> wrote:
>
> > > As the repair is only guaranteed for a majority of replicas, I assume I
> > can discover somewhere which replicas are up to date like this?
> >
> > I’m not quite sure what you mean. Do you mean which nodes have
> > participated in a paxos repair? This information isn’t maintained, but
> > anyway would not imply the node is up to date. A node participating in a
> > paxos repair ensures _a majority of other nodes_ are up-to-date with
> _its_
> > knowledge, give or take.
>
>
> Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
> the opposite way.
>
>
> By performing this on a majority of nodes, we ensure a majority of replicas
> > has a lower bound on the knowledge of a majority, and we effectively
> > invalidate any in-progress operations on any minority that did not
> > participate.
>
>
> And at the end of the repair, this lower bound is known and stored
> somewhere?
>
>
> > > Do I understand correctly, that if I take a backup from such a replica,
> > it is guaranteed to contain the full state up to a certain timestamp t?
> >
> > No, you would need to also perform regular repair afterwards. If you
> > perform a regular repair, by default it will now be preceded by a paxos
> > repair (which is typically very quick), so this will in fact hold, but
> > paxos repair won’t enforce it.
>
>
> Ok, so I'm trying to understand this...
>
> At the end of a Paxos repair, it is guaranteed that each LWT transaction
> has arrived at a majority of replicas. However, it's still not guaranteed
> that any single node would contain all transactions, because it could have
> been in a minority partition for some transactions. Correct so far?
>
> Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas? If some node
> is unavailable, that same node will be missing a bunch of transactions that
> it didn't receive repairs for?
>
>
> I'm thinking through this as I type, but I guess where I'm going is: in the
> universe of possible future work, does there exist a not-too-complex
> modification to CEP-14 where:
>
> 1. Node 1 concludes that a majority of its replicas appear to be available,
> and does its best to send all of its repairs to all of the replicas in that
> majority set.
>
> 2. Node 2 is able to learn that Node 1 successfully sent all of its repair
> writes to this set, and makes an attempt to do the same. If there are
> replicas in the set that it can't reach, they can be subtracted from the
> set, but the set still needs to contain a majority of replicas in the end.
>
> 3. At the end of all nodes doing the above, we would be left with a
> majority set of nodes that are known to - each individually - contain all
> LWT transactions up to the timestamp t.
>
> 4. A benefit of 3: A node N is not in the above majority set. It can now
> repair itself by communicating with a single node from the majority set,
> and copy its transaction log up to timestamp t. After doing so, it can join
> the majority set, as it now contains all transactions up to t.
>
> 5. For a longer outage it may not be possible for node N to ever catch up
> by replaying a serial transaction log. (Including for the reason an old
> enough log may no longer be available.) In this case traditional streaming
> repair would still be used.
>
> Based on your first reply, I guess none of the above is strictly needed to
> achieve the use case I outlined (backup, point in time restore,
> streaming...). It seems I'm attracted by the potential for simplicity of a
> setup where traditional repair is only needed as a fallback option.
> (Ultimately it's needed to bootstrap empty nodes anyway, so it wouldn't go
> away.)
>
>
>
>
>
> > > Does the replica also end up with a complete and continuous log of all
> > writes until t? If not, does a merge of all logs in the majority contain
> a
> > complete log?
> >
> > A majority. There is also no log that gets replicated for LWTs in
> > Cassandra. There is only ever at most one transaction that is in flight
> > (and that may complete) and whose result has not been persisted to some
> > majority, for any key. Paxos repair + repair means the result of the
> > implied log are replicated to all participants.
>
>
> I understand that Cassandra's LWT replication isn't based on replicating a
> single log. However I'm interested to understand whether it would be
> possible to end up with such a log as an outcome of the Paxos
> replication/repair process, since such a log can have other uses.
>
> Even with all of the above, I'm still left wondering: does the repair
> process (with the above modification, say) result in a node having all
> writes that happened before t, or is it only guaranteed to have the most
> recent value for each primary key?
>
>
> Henrik
>
> >
> > From: Henrik Ingo <henrik.i...@datastax.com>
> > Date: Saturday, 4 December 2021 at 23:12
> > To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > Subject: Paxos repairs in CEP-14
> > Could someone elaborate on this section
> >
> > ****
> >
> > *Paxos Repair*
> > We will introduce a new repair mechanism, that can be run with or without
> > regular repair. This mechanism will:
> >
> >    - Track, per-replica, transactions that have been witnessed as
> initiated
> >    but have not been seen to complete
> >    - For a majority of replicas complete (either by invalidating,
> >    completing, or witnessing something newer) all operations they have
> >    witnessed as incomplete prior to the intiation of repair
> >    - Globally invalidate all promises issued prior to the most recent
> paxos
> >    repair
> >
> > ****
> >
> > Specific questions:
> >
> > Assuming a table only using these LWT:s
> >
> > * As the repair is only guaranteed for a majority of replicas, I assume I
> > can discover somewhere which replicas are up to date like this?
> >
> > * Do I understand correctly, that if I take a backup from such a replica,
> > it is guaranteed to contain the full state up to a certain timestamp t?
> > (And in addition may or may not contain mutations higher than t, which of
> > course could overwrite the value the same key had at t.)
> >
> > * Does the replica also end up with a complete and continuous log of all
> > writes until t? If not, does a merge of all logs in the majority contain
> a
> > complete log? In particular, I'm trying to parse the significance of "or
> > witnessing something newer"? (Use case for this last question could be
> > point in time restore, aka continuous backup, or also streaming writes
> to a
> > downstream system.)
> >
> > henrik
> > --
> >
> > Henrik Ingo
> >
> > +358 40 569 7354 <358405697354>
> >
> > [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> on
> > Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> YouTube.]
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> > >
> >   [image: Visit my LinkedIn profile.] <
> >
> https://urldefense.com/v3/__https://www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$<https://urldefense.com/v3/__https:/www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$>
> <
> https://urldefense.com/v3/__https:/www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$
> >
> > >
> >
>

Re: Paxos repairs in CEP-14

Reply via email to