> And at the end of the repair, this lower bound is known and stored
somewhere?

Yes, there is a new system.paxos_repair_history table

> Under good conditions, I assume the result of a paxos repair is that all
nodes received all LWT transactions from all other replicas?

All in progress LWTs are flushed, essentially. They are either completed or 
invalidated. So there is a synchronisation point for the range being repaired, 
but there is no impact on any completed transactions. So even if paxos repair 
successfully sync’d all in progress transactions to every node, there could 
still be some past transactions that were persisted only to a majority of 
nodes, and these will be invisible to the paxos repair mechanism. There is no 
transaction log today in Cassandra to sync, so repair of the underlying data 
table is still the only way to guarantee data is synchronised to every node.

CEP-15 will change this, so that nodes will be fully consistent up to some 
logical timestamp, but CEP-14 does not change the underlying semantics of LWTs 
and Paxos in Cassandra.

From: Henrik Ingo <henrik.i...@datastax.com>
Date: Sunday, 5 December 2021 at 11:45
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: Paxos repairs in CEP-14
On Sun, 5 Dec 2021, 1.45 bened...@apache.org, <bened...@apache.org> wrote:

> > As the repair is only guaranteed for a majority of replicas, I assume I
> can discover somewhere which replicas are up to date like this?
>
> I’m not quite sure what you mean. Do you mean which nodes have
> participated in a paxos repair? This information isn’t maintained, but
> anyway would not imply the node is up to date. A node participating in a
> paxos repair ensures _a majority of other nodes_ are up-to-date with _its_
> knowledge, give or take.


Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
the opposite way.


By performing this on a majority of nodes, we ensure a majority of replicas
> has a lower bound on the knowledge of a majority, and we effectively
> invalidate any in-progress operations on any minority that did not
> participate.


And at the end of the repair, this lower bound is known and stored
somewhere?


> > Do I understand correctly, that if I take a backup from such a replica,
> it is guaranteed to contain the full state up to a certain timestamp t?
>
> No, you would need to also perform regular repair afterwards. If you
> perform a regular repair, by default it will now be preceded by a paxos
> repair (which is typically very quick), so this will in fact hold, but
> paxos repair won’t enforce it.


Ok, so I'm trying to understand this...

At the end of a Paxos repair, it is guaranteed that each LWT transaction
has arrived at a majority of replicas. However, it's still not guaranteed
that any single node would contain all transactions, because it could have
been in a minority partition for some transactions. Correct so far?

Under good conditions, I assume the result of a paxos repair is that all
nodes received all LWT transactions from all other replicas? If some node
is unavailable, that same node will be missing a bunch of transactions that
it didn't receive repairs for?


I'm thinking through this as I type, but I guess where I'm going is: in the
universe of possible future work, does there exist a not-too-complex
modification to CEP-14 where:

1. Node 1 concludes that a majority of its replicas appear to be available,
and does its best to send all of its repairs to all of the replicas in that
majority set.

2. Node 2 is able to learn that Node 1 successfully sent all of its repair
writes to this set, and makes an attempt to do the same. If there are
replicas in the set that it can't reach, they can be subtracted from the
set, but the set still needs to contain a majority of replicas in the end.

3. At the end of all nodes doing the above, we would be left with a
majority set of nodes that are known to - each individually - contain all
LWT transactions up to the timestamp t.

4. A benefit of 3: A node N is not in the above majority set. It can now
repair itself by communicating with a single node from the majority set,
and copy its transaction log up to timestamp t. After doing so, it can join
the majority set, as it now contains all transactions up to t.

5. For a longer outage it may not be possible for node N to ever catch up
by replaying a serial transaction log. (Including for the reason an old
enough log may no longer be available.) In this case traditional streaming
repair would still be used.

Based on your first reply, I guess none of the above is strictly needed to
achieve the use case I outlined (backup, point in time restore,
streaming...). It seems I'm attracted by the potential for simplicity of a
setup where traditional repair is only needed as a fallback option.
(Ultimately it's needed to bootstrap empty nodes anyway, so it wouldn't go
away.)





> > Does the replica also end up with a complete and continuous log of all
> writes until t? If not, does a merge of all logs in the majority contain a
> complete log?
>
> A majority. There is also no log that gets replicated for LWTs in
> Cassandra. There is only ever at most one transaction that is in flight
> (and that may complete) and whose result has not been persisted to some
> majority, for any key. Paxos repair + repair means the result of the
> implied log are replicated to all participants.


I understand that Cassandra's LWT replication isn't based on replicating a
single log. However I'm interested to understand whether it would be
possible to end up with such a log as an outcome of the Paxos
replication/repair process, since such a log can have other uses.

Even with all of the above, I'm still left wondering: does the repair
process (with the above modification, say) result in a node having all
writes that happened before t, or is it only guaranteed to have the most
recent value for each primary key?


Henrik

>
> From: Henrik Ingo <henrik.i...@datastax.com>
> Date: Saturday, 4 December 2021 at 23:12
> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> Subject: Paxos repairs in CEP-14
> Could someone elaborate on this section
>
> ****
>
> *Paxos Repair*
> We will introduce a new repair mechanism, that can be run with or without
> regular repair. This mechanism will:
>
>    - Track, per-replica, transactions that have been witnessed as initiated
>    but have not been seen to complete
>    - For a majority of replicas complete (either by invalidating,
>    completing, or witnessing something newer) all operations they have
>    witnessed as incomplete prior to the intiation of repair
>    - Globally invalidate all promises issued prior to the most recent paxos
>    repair
>
> ****
>
> Specific questions:
>
> Assuming a table only using these LWT:s
>
> * As the repair is only guaranteed for a majority of replicas, I assume I
> can discover somewhere which replicas are up to date like this?
>
> * Do I understand correctly, that if I take a backup from such a replica,
> it is guaranteed to contain the full state up to a certain timestamp t?
> (And in addition may or may not contain mutations higher than t, which of
> course could overwrite the value the same key had at t.)
>
> * Does the replica also end up with a complete and continuous log of all
> writes until t? If not, does a merge of all logs in the majority contain a
> complete log? In particular, I'm trying to parse the significance of "or
> witnessing something newer"? (Use case for this last question could be
> point in time restore, aka continuous backup, or also streaming writes to a
> downstream system.)
>
> henrik
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> >
>   [image: Visit my LinkedIn profile.] <
> https://urldefense.com/v3/__https://www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$<https://urldefense.com/v3/__https:/www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$>
> >
>

Reply via email to