Re: Paxos repairs in CEP-14

2021-12-05 Thread Henrik Ingo
On Sun, 5 Dec 2021, 18.40 bened...@apache.org,  wrote:

> > And at the end of the repair, this lower bound is known and stored
> somewhere?
>
> Yes, there is a new system.paxos_repair_history table
>
> > Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas?
>
> All in progress LWTs are flushed, essentially. They are either completed
> or invalidated. So there is a synchronisation point for the range being
> repaired, but there is no impact on any completed transactions. So even if
> paxos repair successfully sync’d all in progress transactions to every
> node, there could still be some past transactions that were persisted only
> to a majority of nodes, and these will be invisible to the paxos repair
> mechanism.


Cool. This clarifies.


There is no transaction log today in Cassandra to sync, so repair of the
> underlying data table is still the only way to guarantee data is
> synchronised to every node.
>

It's not the transaction log as such that I'm missing. (Or it is, but I
understand there isn't one.) What is hard to wrap my head around is how a
given partition can participate in a successful Paxos transaction even if
it might be completely unaware of the previous transaction to the same
partition. At least this is how I've understood this conversation?


> CEP-15 will change this, so that nodes will be fully consistent up to some
> logical timestamp, but CEP-14 does not change the underlying semantics of
> LWTs and Paxos in Cassandra.
>

Yes, looking forward to that. I just wanted to check whether CEP-14 would
possibly contain aome per partition version of the same ideas.

But even with everything you've explained, did I understand correctly that
(focusing on a single partition and only LWT writes...) I can in any event
stream commit logs from a majority of replicas, merge them, and such a
merged log must contain all committed transactions to that partition. (And
this should have nothing to do with the repair, then?)

Henrik



>
>
>
>
> From: Henrik Ingo 
> Date: Sunday, 5 December 2021 at 11:45
> To: dev@cassandra.apache.org 
> Subject: Re: Paxos repairs in CEP-14
> On Sun, 5 Dec 2021, 1.45 bened...@apache.org,  wrote:
>
> > > As the repair is only guaranteed for a majority of replicas, I assume I
> > can discover somewhere which replicas are up to date like this?
> >
> > I’m not quite sure what you mean. Do you mean which nodes have
> > participated in a paxos repair? This information isn’t maintained, but
> > anyway would not imply the node is up to date. A node participating in a
> > paxos repair ensures _a majority of other nodes_ are up-to-date with
> _its_
> > knowledge, give or take.
>
>
> Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
> the opposite way.
>
>
> By performing this on a majority of nodes, we ensure a majority of replicas
> > has a lower bound on the knowledge of a majority, and we effectively
> > invalidate any in-progress operations on any minority that did not
> > participate.
>
>
> And at the end of the repair, this lower bound is known and stored
> somewhere?
>
>
> > > Do I understand correctly, that if I take a backup from such a replica,
> > it is guaranteed to contain the full state up to a certain timestamp t?
> >
> > No, you would need to also perform regular repair afterwards. If you
> > perform a regular repair, by default it will now be preceded by a paxos
> > repair (which is typically very quick), so this will in fact hold, but
> > paxos repair won’t enforce it.
>
>
> Ok, so I'm trying to understand this...
>
> At the end of a Paxos repair, it is guaranteed that each LWT transaction
> has arrived at a majority of replicas. However, it's still not guaranteed
> that any single node would contain all transactions, because it could have
> been in a minority partition for some transactions. Correct so far?
>
> Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas? If some node
> is unavailable, that same node will be missing a bunch of transactions that
> it didn't receive repairs for?
>
>
> I'm thinking through this as I type, but I guess where I'm going is: in the
> universe of possible future work, does there exist a not-too-complex
> modification to CEP-14 where:
>
> 1. Node 1 concludes that a majority of its replicas appear to be available,
> and does its best to send all of its repairs to all of the replicas in that
> majority set.
>
> 2. Node 2 is able to learn that Node 1 successfully sent all of its repair
> writes to this set, and makes an att

Re: Paxos repairs in CEP-14

2021-12-05 Thread Henrik Ingo
On Sun, 5 Dec 2021, 1.45 bened...@apache.org,  wrote:

> > As the repair is only guaranteed for a majority of replicas, I assume I
> can discover somewhere which replicas are up to date like this?
>
> I’m not quite sure what you mean. Do you mean which nodes have
> participated in a paxos repair? This information isn’t maintained, but
> anyway would not imply the node is up to date. A node participating in a
> paxos repair ensures _a majority of other nodes_ are up-to-date with _its_
> knowledge, give or take.


Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
the opposite way.


By performing this on a majority of nodes, we ensure a majority of replicas
> has a lower bound on the knowledge of a majority, and we effectively
> invalidate any in-progress operations on any minority that did not
> participate.


And at the end of the repair, this lower bound is known and stored
somewhere?


> > Do I understand correctly, that if I take a backup from such a replica,
> it is guaranteed to contain the full state up to a certain timestamp t?
>
> No, you would need to also perform regular repair afterwards. If you
> perform a regular repair, by default it will now be preceded by a paxos
> repair (which is typically very quick), so this will in fact hold, but
> paxos repair won’t enforce it.


Ok, so I'm trying to understand this...

At the end of a Paxos repair, it is guaranteed that each LWT transaction
has arrived at a majority of replicas. However, it's still not guaranteed
that any single node would contain all transactions, because it could have
been in a minority partition for some transactions. Correct so far?

Under good conditions, I assume the result of a paxos repair is that all
nodes received all LWT transactions from all other replicas? If some node
is unavailable, that same node will be missing a bunch of transactions that
it didn't receive repairs for?


I'm thinking through this as I type, but I guess where I'm going is: in the
universe of possible future work, does there exist a not-too-complex
modification to CEP-14 where:

1. Node 1 concludes that a majority of its replicas appear to be available,
and does its best to send all of its repairs to all of the replicas in that
majority set.

2. Node 2 is able to learn that Node 1 successfully sent all of its repair
writes to this set, and makes an attempt to do the same. If there are
replicas in the set that it can't reach, they can be subtracted from the
set, but the set still needs to contain a majority of replicas in the end.

3. At the end of all nodes doing the above, we would be left with a
majority set of nodes that are known to - each individually - contain all
LWT transactions up to the timestamp t.

4. A benefit of 3: A node N is not in the above majority set. It can now
repair itself by communicating with a single node from the majority set,
and copy its transaction log up to timestamp t. After doing so, it can join
the majority set, as it now contains all transactions up to t.

5. For a longer outage it may not be possible for node N to ever catch up
by replaying a serial transaction log. (Including for the reason an old
enough log may no longer be available.) In this case traditional streaming
repair would still be used.

Based on your first reply, I guess none of the above is strictly needed to
achieve the use case I outlined (backup, point in time restore,
streaming...). It seems I'm attracted by the potential for simplicity of a
setup where traditional repair is only needed as a fallback option.
(Ultimately it's needed to bootstrap empty nodes anyway, so it wouldn't go
away.)





> > Does the replica also end up with a complete and continuous log of all
> writes until t? If not, does a merge of all logs in the majority contain a
> complete log?
>
> A majority. There is also no log that gets replicated for LWTs in
> Cassandra. There is only ever at most one transaction that is in flight
> (and that may complete) and whose result has not been persisted to some
> majority, for any key. Paxos repair + repair means the result of the
> implied log are replicated to all participants.


I understand that Cassandra's LWT replication isn't based on replicating a
single log. However I'm interested to understand whether it would be
possible to end up with such a log as an outcome of the Paxos
replication/repair process, since such a log can have other uses.

Even with all of the above, I'm still left wondering: does the repair
process (with the above modification, say) result in a node having all
writes that happened before t, or is it only guaranteed to have the most
recent value for each primary key?


Henrik

>
> From: Henrik Ingo 
> Date: Saturday, 4 December 2021 at 23:12
> To: dev@cassandra.apache.org 
> Subject: Paxos repairs in CEP-14
> Could someone elaborate on this section
>
> 
>
> *Paxos Repair*
> We will introduce a new repair mechanism, that can be run with or without
> regular repair. This mechanism