Re: Paxos repairs in CEP-14

2021-12-05 Thread bened...@apache.org
> What is hard to wrap my head around is how a
given partition can participate in a successful Paxos transaction even if
it might be completely unaware of the previous transaction to the same
partition.

The partition is aware of every prior transaction, but no specific replica is 
necessarily aware of any given preceding transaction. A quorum of data (or data 
+ digest) responses is needed for the coordinator to assemble the prior state, 
in order to compute the new state.

> I can in any event stream commit logs from a majority of replicas, merge 
> them, and such a merged log must contain all committed transactions to that 
> partition

Yes. This was essentially true prior to CEP-14, however. The only real 
difference is that “committed” includes those that have been acknowledged to 
clients with an asynchronous commit phase, i.e. if the commit consistency level 
is e.g. LOCAL_QUORUM, ONE or ANY, thereby reducing the number of necessary 
round-trips by one. Today any transaction that has completed the commit phase 
would be visible in the commit log, and if your commit is synchronous (i.e. 
uses the default QUORUM consistency for SERIAL writes) then a majority of 
commit logs will contain the transaction.

While on the topic, something the community may want to consider is how we 
message CEP-14, and how it might affect default behaviour. Currently, by 
default, commit consistency will remain QUORUM after CEP-14. But once CEP-14 is 
fully enabled on a cluster we might want to default to LOCAL_QUORUM. We might 
want to consider if we want to automatically enable any elements of CEP-14 
also, given that they fix various consistency issues as well as performance 
issues.

From: Henrik Ingo 
Date: Sunday, 5 December 2021 at 19:24
To: dev@cassandra.apache.org 
Subject: Re: Paxos repairs in CEP-14
On Sun, 5 Dec 2021, 18.40 bened...@apache.org,  wrote:

> > And at the end of the repair, this lower bound is known and stored
> somewhere?
>
> Yes, there is a new system.paxos_repair_history table
>
> > Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas?
>
> All in progress LWTs are flushed, essentially. They are either completed
> or invalidated. So there is a synchronisation point for the range being
> repaired, but there is no impact on any completed transactions. So even if
> paxos repair successfully sync’d all in progress transactions to every
> node, there could still be some past transactions that were persisted only
> to a majority of nodes, and these will be invisible to the paxos repair
> mechanism.


Cool. This clarifies.


There is no transaction log today in Cassandra to sync, so repair of the
> underlying data table is still the only way to guarantee data is
> synchronised to every node.
>

It's not the transaction log as such that I'm missing. (Or it is, but I
understand there isn't one.) What is hard to wrap my head around is how a
given partition can participate in a successful Paxos transaction even if
it might be completely unaware of the previous transaction to the same
partition. At least this is how I've understood this conversation?


> CEP-15 will change this, so that nodes will be fully consistent up to some
> logical timestamp, but CEP-14 does not change the underlying semantics of
> LWTs and Paxos in Cassandra.
>

Yes, looking forward to that. I just wanted to check whether CEP-14 would
possibly contain aome per partition version of the same ideas.

But even with everything you've explained, did I understand correctly that
(focusing on a single partition and only LWT writes...) I can in any event
stream commit logs from a majority of replicas, merge them, and such a
merged log must contain all committed transactions to that partition. (And
this should have nothing to do with the repair, then?)

Henrik



>
>
>
>
> From: Henrik Ingo 
> Date: Sunday, 5 December 2021 at 11:45
> To: dev@cassandra.apache.org 
> Subject: Re: Paxos repairs in CEP-14
> On Sun, 5 Dec 2021, 1.45 bened...@apache.org,  wrote:
>
> > > As the repair is only guaranteed for a majority of replicas, I assume I
> > can discover somewhere which replicas are up to date like this?
> >
> > I’m not quite sure what you mean. Do you mean which nodes have
> > participated in a paxos repair? This information isn’t maintained, but
> > anyway would not imply the node is up to date. A node participating in a
> > paxos repair ensures _a majority of other nodes_ are up-to-date with
> _its_
> > knowledge, give or take.
>
>
> Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
> the opposite way.
>
>
> By performing this on a majority of nodes, we ensure a majority of replicas
> > has a lower bound on the knowledge of

Re: Paxos repairs in CEP-14

2021-12-05 Thread Henrik Ingo
On Sun, 5 Dec 2021, 18.40 bened...@apache.org,  wrote:

> > And at the end of the repair, this lower bound is known and stored
> somewhere?
>
> Yes, there is a new system.paxos_repair_history table
>
> > Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas?
>
> All in progress LWTs are flushed, essentially. They are either completed
> or invalidated. So there is a synchronisation point for the range being
> repaired, but there is no impact on any completed transactions. So even if
> paxos repair successfully sync’d all in progress transactions to every
> node, there could still be some past transactions that were persisted only
> to a majority of nodes, and these will be invisible to the paxos repair
> mechanism.


Cool. This clarifies.


There is no transaction log today in Cassandra to sync, so repair of the
> underlying data table is still the only way to guarantee data is
> synchronised to every node.
>

It's not the transaction log as such that I'm missing. (Or it is, but I
understand there isn't one.) What is hard to wrap my head around is how a
given partition can participate in a successful Paxos transaction even if
it might be completely unaware of the previous transaction to the same
partition. At least this is how I've understood this conversation?


> CEP-15 will change this, so that nodes will be fully consistent up to some
> logical timestamp, but CEP-14 does not change the underlying semantics of
> LWTs and Paxos in Cassandra.
>

Yes, looking forward to that. I just wanted to check whether CEP-14 would
possibly contain aome per partition version of the same ideas.

But even with everything you've explained, did I understand correctly that
(focusing on a single partition and only LWT writes...) I can in any event
stream commit logs from a majority of replicas, merge them, and such a
merged log must contain all committed transactions to that partition. (And
this should have nothing to do with the repair, then?)

Henrik



>
>
>
>
> From: Henrik Ingo 
> Date: Sunday, 5 December 2021 at 11:45
> To: dev@cassandra.apache.org 
> Subject: Re: Paxos repairs in CEP-14
> On Sun, 5 Dec 2021, 1.45 bened...@apache.org,  wrote:
>
> > > As the repair is only guaranteed for a majority of replicas, I assume I
> > can discover somewhere which replicas are up to date like this?
> >
> > I’m not quite sure what you mean. Do you mean which nodes have
> > participated in a paxos repair? This information isn’t maintained, but
> > anyway would not imply the node is up to date. A node participating in a
> > paxos repair ensures _a majority of other nodes_ are up-to-date with
> _its_
> > knowledge, give or take.
>
>
> Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
> the opposite way.
>
>
> By performing this on a majority of nodes, we ensure a majority of replicas
> > has a lower bound on the knowledge of a majority, and we effectively
> > invalidate any in-progress operations on any minority that did not
> > participate.
>
>
> And at the end of the repair, this lower bound is known and stored
> somewhere?
>
>
> > > Do I understand correctly, that if I take a backup from such a replica,
> > it is guaranteed to contain the full state up to a certain timestamp t?
> >
> > No, you would need to also perform regular repair afterwards. If you
> > perform a regular repair, by default it will now be preceded by a paxos
> > repair (which is typically very quick), so this will in fact hold, but
> > paxos repair won’t enforce it.
>
>
> Ok, so I'm trying to understand this...
>
> At the end of a Paxos repair, it is guaranteed that each LWT transaction
> has arrived at a majority of replicas. However, it's still not guaranteed
> that any single node would contain all transactions, because it could have
> been in a minority partition for some transactions. Correct so far?
>
> Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas? If some node
> is unavailable, that same node will be missing a bunch of transactions that
> it didn't receive repairs for?
>
>
> I'm thinking through this as I type, but I guess where I'm going is: in the
> universe of possible future work, does there exist a not-too-complex
> modification to CEP-14 where:
>
> 1. Node 1 concludes that a majority of its replicas appear to be available,
> and does its best to send all of its repairs to all of the replicas in that
> majority set.
>
> 2. Node 2 is able to learn that Node 1 successfully sent all of it

Re: Paxos repairs in CEP-14

2021-12-05 Thread bened...@apache.org
> And at the end of the repair, this lower bound is known and stored
somewhere?

Yes, there is a new system.paxos_repair_history table

> Under good conditions, I assume the result of a paxos repair is that all
nodes received all LWT transactions from all other replicas?

All in progress LWTs are flushed, essentially. They are either completed or 
invalidated. So there is a synchronisation point for the range being repaired, 
but there is no impact on any completed transactions. So even if paxos repair 
successfully sync’d all in progress transactions to every node, there could 
still be some past transactions that were persisted only to a majority of 
nodes, and these will be invisible to the paxos repair mechanism. There is no 
transaction log today in Cassandra to sync, so repair of the underlying data 
table is still the only way to guarantee data is synchronised to every node.

CEP-15 will change this, so that nodes will be fully consistent up to some 
logical timestamp, but CEP-14 does not change the underlying semantics of LWTs 
and Paxos in Cassandra.

From: Henrik Ingo 
Date: Sunday, 5 December 2021 at 11:45
To: dev@cassandra.apache.org 
Subject: Re: Paxos repairs in CEP-14
On Sun, 5 Dec 2021, 1.45 bened...@apache.org,  wrote:

> > As the repair is only guaranteed for a majority of replicas, I assume I
> can discover somewhere which replicas are up to date like this?
>
> I’m not quite sure what you mean. Do you mean which nodes have
> participated in a paxos repair? This information isn’t maintained, but
> anyway would not imply the node is up to date. A node participating in a
> paxos repair ensures _a majority of other nodes_ are up-to-date with _its_
> knowledge, give or take.


Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
the opposite way.


By performing this on a majority of nodes, we ensure a majority of replicas
> has a lower bound on the knowledge of a majority, and we effectively
> invalidate any in-progress operations on any minority that did not
> participate.


And at the end of the repair, this lower bound is known and stored
somewhere?


> > Do I understand correctly, that if I take a backup from such a replica,
> it is guaranteed to contain the full state up to a certain timestamp t?
>
> No, you would need to also perform regular repair afterwards. If you
> perform a regular repair, by default it will now be preceded by a paxos
> repair (which is typically very quick), so this will in fact hold, but
> paxos repair won’t enforce it.


Ok, so I'm trying to understand this...

At the end of a Paxos repair, it is guaranteed that each LWT transaction
has arrived at a majority of replicas. However, it's still not guaranteed
that any single node would contain all transactions, because it could have
been in a minority partition for some transactions. Correct so far?

Under good conditions, I assume the result of a paxos repair is that all
nodes received all LWT transactions from all other replicas? If some node
is unavailable, that same node will be missing a bunch of transactions that
it didn't receive repairs for?


I'm thinking through this as I type, but I guess where I'm going is: in the
universe of possible future work, does there exist a not-too-complex
modification to CEP-14 where:

1. Node 1 concludes that a majority of its replicas appear to be available,
and does its best to send all of its repairs to all of the replicas in that
majority set.

2. Node 2 is able to learn that Node 1 successfully sent all of its repair
writes to this set, and makes an attempt to do the same. If there are
replicas in the set that it can't reach, they can be subtracted from the
set, but the set still needs to contain a majority of replicas in the end.

3. At the end of all nodes doing the above, we would be left with a
majority set of nodes that are known to - each individually - contain all
LWT transactions up to the timestamp t.

4. A benefit of 3: A node N is not in the above majority set. It can now
repair itself by communicating with a single node from the majority set,
and copy its transaction log up to timestamp t. After doing so, it can join
the majority set, as it now contains all transactions up to t.

5. For a longer outage it may not be possible for node N to ever catch up
by replaying a serial transaction log. (Including for the reason an old
enough log may no longer be available.) In this case traditional streaming
repair would still be used.

Based on your first reply, I guess none of the above is strictly needed to
achieve the use case I outlined (backup, point in time restore,
streaming...). It seems I'm attracted by the potential for simplicity of a
setup where traditional repair is only needed as a fallback option.
(Ultimately it's needed to bootstrap empty nodes anyway, so it wouldn't go
away.)





> > Does the replica also end up w

Re: Paxos repairs in CEP-14

2021-12-05 Thread Henrik Ingo
On Sun, 5 Dec 2021, 1.45 bened...@apache.org,  wrote:

> > As the repair is only guaranteed for a majority of replicas, I assume I
> can discover somewhere which replicas are up to date like this?
>
> I’m not quite sure what you mean. Do you mean which nodes have
> participated in a paxos repair? This information isn’t maintained, but
> anyway would not imply the node is up to date. A node participating in a
> paxos repair ensures _a majority of other nodes_ are up-to-date with _its_
> knowledge, give or take.


Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
the opposite way.


By performing this on a majority of nodes, we ensure a majority of replicas
> has a lower bound on the knowledge of a majority, and we effectively
> invalidate any in-progress operations on any minority that did not
> participate.


And at the end of the repair, this lower bound is known and stored
somewhere?


> > Do I understand correctly, that if I take a backup from such a replica,
> it is guaranteed to contain the full state up to a certain timestamp t?
>
> No, you would need to also perform regular repair afterwards. If you
> perform a regular repair, by default it will now be preceded by a paxos
> repair (which is typically very quick), so this will in fact hold, but
> paxos repair won’t enforce it.


Ok, so I'm trying to understand this...

At the end of a Paxos repair, it is guaranteed that each LWT transaction
has arrived at a majority of replicas. However, it's still not guaranteed
that any single node would contain all transactions, because it could have
been in a minority partition for some transactions. Correct so far?

Under good conditions, I assume the result of a paxos repair is that all
nodes received all LWT transactions from all other replicas? If some node
is unavailable, that same node will be missing a bunch of transactions that
it didn't receive repairs for?


I'm thinking through this as I type, but I guess where I'm going is: in the
universe of possible future work, does there exist a not-too-complex
modification to CEP-14 where:

1. Node 1 concludes that a majority of its replicas appear to be available,
and does its best to send all of its repairs to all of the replicas in that
majority set.

2. Node 2 is able to learn that Node 1 successfully sent all of its repair
writes to this set, and makes an attempt to do the same. If there are
replicas in the set that it can't reach, they can be subtracted from the
set, but the set still needs to contain a majority of replicas in the end.

3. At the end of all nodes doing the above, we would be left with a
majority set of nodes that are known to - each individually - contain all
LWT transactions up to the timestamp t.

4. A benefit of 3: A node N is not in the above majority set. It can now
repair itself by communicating with a single node from the majority set,
and copy its transaction log up to timestamp t. After doing so, it can join
the majority set, as it now contains all transactions up to t.

5. For a longer outage it may not be possible for node N to ever catch up
by replaying a serial transaction log. (Including for the reason an old
enough log may no longer be available.) In this case traditional streaming
repair would still be used.

Based on your first reply, I guess none of the above is strictly needed to
achieve the use case I outlined (backup, point in time restore,
streaming...). It seems I'm attracted by the potential for simplicity of a
setup where traditional repair is only needed as a fallback option.
(Ultimately it's needed to bootstrap empty nodes anyway, so it wouldn't go
away.)





> > Does the replica also end up with a complete and continuous log of all
> writes until t? If not, does a merge of all logs in the majority contain a
> complete log?
>
> A majority. There is also no log that gets replicated for LWTs in
> Cassandra. There is only ever at most one transaction that is in flight
> (and that may complete) and whose result has not been persisted to some
> majority, for any key. Paxos repair + repair means the result of the
> implied log are replicated to all participants.


I understand that Cassandra's LWT replication isn't based on replicating a
single log. However I'm interested to understand whether it would be
possible to end up with such a log as an outcome of the Paxos
replication/repair process, since such a log can have other uses.

Even with all of the above, I'm still left wondering: does the repair
process (with the above modification, say) result in a node having all
writes that happened before t, or is it only guaranteed to have the most
recent value for each primary key?


Henrik

>
> From: Henrik Ingo 
> Date: Saturday, 4 December 2021 at 23:12
> To: dev@cassandra.apache.org 
> Subject: Paxos repairs in CEP-14
> Could someone elaborate on this section
>
> 
>
> *Paxos Repair*
> We will introduce a new repair mechanism, that can be run with or without
> regular repair. This mechanism wil

Re: Paxos repairs in CEP-14

2021-12-04 Thread bened...@apache.org
> As the repair is only guaranteed for a majority of replicas, I assume I
can discover somewhere which replicas are up to date like this?

I’m not quite sure what you mean. Do you mean which nodes have participated in 
a paxos repair? This information isn’t maintained, but anyway would not imply 
the node is up to date. A node participating in a paxos repair ensures _a 
majority of other nodes_ are up-to-date with _its_ knowledge, give or take. By 
performing this on a majority of nodes, we ensure a majority of replicas has a 
lower bound on the knowledge of a majority, and we effectively invalidate any 
in-progress operations on any minority that did not participate.

> Do I understand correctly, that if I take a backup from such a replica,
it is guaranteed to contain the full state up to a certain timestamp t?

No, you would need to also perform regular repair afterwards. If you perform a 
regular repair, by default it will now be preceded by a paxos repair (which is 
typically very quick), so this will in fact hold, but paxos repair won’t 
enforce it.

> Does the replica also end up with a complete and continuous log of all
writes until t? If not, does a merge of all logs in the majority contain a
complete log?

A majority. There is also no log that gets replicated for LWTs in Cassandra. 
There is only ever at most one transaction that is in flight (and that may 
complete) and whose result has not been persisted to some majority, for any 
key. Paxos repair + repair means the result of the implied log are replicated 
to all participants.

From: Henrik Ingo 
Date: Saturday, 4 December 2021 at 23:12
To: dev@cassandra.apache.org 
Subject: Paxos repairs in CEP-14
Could someone elaborate on this section



*Paxos Repair*
We will introduce a new repair mechanism, that can be run with or without
regular repair. This mechanism will:

   - Track, per-replica, transactions that have been witnessed as initiated
   but have not been seen to complete
   - For a majority of replicas complete (either by invalidating,
   completing, or witnessing something newer) all operations they have
   witnessed as incomplete prior to the intiation of repair
   - Globally invalidate all promises issued prior to the most recent paxos
   repair



Specific questions:

Assuming a table only using these LWT:s

* As the repair is only guaranteed for a majority of replicas, I assume I
can discover somewhere which replicas are up to date like this?

* Do I understand correctly, that if I take a backup from such a replica,
it is guaranteed to contain the full state up to a certain timestamp t?
(And in addition may or may not contain mutations higher than t, which of
course could overwrite the value the same key had at t.)

* Does the replica also end up with a complete and continuous log of all
writes until t? If not, does a merge of all logs in the majority contain a
complete log? In particular, I'm trying to parse the significance of "or
witnessing something newer"? (Use case for this last question could be
point in time restore, aka continuous backup, or also streaming writes to a
downstream system.)

henrik
--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.]   [image: Visit us on
Twitter.]   [image: Visit us on YouTube.]

  [image: Visit my LinkedIn profile.]