+1 (nb)

________________________________
From: Ariel Weisberg <ar...@weisberg.ws>
Sent: Tuesday, May 6, 2025 12:59:09 PM
To: Claude Warren, Jr <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Hi,

On Sun, May 4, 2025, at 4:57 PM, Jordan West wrote:
I’m generally supportive. The concept is one that I can see the benefits of and 
I also think the current implementation adds a lot of complexity to the 
codebase for being stuck in experimental mode. It will be great to have a more 
robust version built on a better approach.

One of the great things about this is that it actually deletes and simplifies 
implementation code. If you ignore the hat trick of mutation tracking making it 
possible to have log only replication of course.

So far it's been mostly deleted and changed lines to get the single partition 
read, range read, and write path working. A lot of the code already exists for 
transient replication so it's changed rather than new code. PaxosV2 and Accord 
will both need to become witness aware and that will be new code, but it's 
relatively straightforward in that it's just picking full replicas for reads.

On Mon, May 5, 2025, at 1:21 PM, Nate McCall wrote:
I'd like to see a note on the CEP about documentation overhead as this is an 
important feature to communicate correctly, but that's just a nit. +1 on moving 
forward with this overall.
There is documentation for transient replication 
https://cassandra.apache.org/doc/4.0/cassandra/new/transientreplication.html 
which needs to be promoted out of "What's new", updated, and linked to the 
documentation for mutation tracking. I'll update the CEP to cover this.

On Mon, May 5, 2025, at 1:49 PM, Jon Haddad wrote:
It took me a bit to wrap my head around how this works, but now that I think I 
understand the idea, it sounds like a solid improvement.  Being able to achieve 
the same results as quorum but costing 1/3 less is a *big deal* and I know 
several teams that would be interested.
1/3rd is the "free" threshold where you don't give increase your probability of 
experiencing data loss using quorums for common topologies. If you have a lot 
of replicas because say you want copies in many places you might be able to 
reduce further. Voting on what the value is is basically decoupled from how 
redundantly that value is stored long term.

One thing I'm curious about (and we can break it out into a separate 
discussion), is how all the functionality that requires coordination and global 
state (repaired vs non-repaired) will affect backups.  Without a 
synchronization primitive to take a cluster-wide snapshot, how can we safely 
restore from eventually consistent backups without risking consistency issues 
due to out-of-sync repaired status?
Witnesses doesn't make the consistency of backups better/worse, but it does add 
a little bit of complexity if your backups are copying only the repaired data.

The procedure you follow today where you copy the repaired tables from a range 
from a single replica and copy the unrepaired tables from a quorum would 
continue to apply. The added constraint with witnesses is that the single 
replica you are picking to copy repaired sstables from needs to be a full 
replica not a witness for that range.

I don't think we have a way to get a consistent snapshot right now? Like there 
isn't even "run repair and repair will create a consistent snapshot for you to 
copy as a backup". And then as Benedict points out LWT (with async commit) and 
Accord (also defaults to async commit, has multi-key transactions that can be 
torn) both don't make for consistent backups.

We definitely need to follow up with leveraging new replication/transactions 
schemes to produce more consistent backups.

Ariel


On Sun, May 4, 2025 at 00:27 Benedict 
<bened...@apache.org<mailto:bened...@apache.org>> wrote:
+1

This is an obviously good feature for operators that are storage-bound in 
multi-DC deployments but want to retain their latency characteristics during 
node maintenance. Log replicas are the right approach.

> On 3 May 2025, at 23:42, sc...@paradoxica.net<mailto:sc...@paradoxica.net> 
> wrote:
>
> Hey everybody, bumping this CEP from Ariel in case you'd like some weekend 
> reading.
>
> We’d like to finish witnesses and bring them out of “experimental” status now 
> that Transactional Metadata and Mutation Tracking provide the building blocks 
> needed to complete them.
>
> Witnesses are part of a family of approaches in replicated storage systems to 
> maintain or boost availability and durability while reducing storage costs. 
> Log replicas are a close relative. Both are used by leading cloud databases – 
> for instance, Spanner implements witness replicas [1] while DynamoDB 
> implements log replicas [2].
>
> Witness replicas are a great fit for topologies that replicate at greater 
> than RF=3 –– most commonly multi-DC/multi-region deployments. Today in 
> Cassandra, all members of a voting quorum replicate all data forever. Witness 
> replicas let users break this coupling. They allow one to define voting 
> quorums that are larger than the number of copies of data that are stored in 
> perpetuity.
>
> Take a 3× DC cluster replicated at RF=3 in each DC as an example. In this 
> topology, Cassandra stores 9× copies of the database forever - huge storage 
> amplification. Witnesses allow users to maintain a voting quorum of 9 members 
> (3× per DC); but reduce the durable replicas to 2× per DC – e.g., two durable 
> replicas and one witness. This maintains the availability properties of an 
> RF=3×3 topology while reducing storage costs by 33%, going from 9× copies to 
> 6×.
>
> The role of a witness is to "witness" a write and persist it until it has 
> been reconciled among all durable replicas; and to respond to read requests 
> for witnessed writes awaiting reconciliation. Note that witnesses don't 
> introduce a dedicated role for a node – whether a node is a durable replica 
> or witness for a token just depends on its position in the ring.
>
> This CEP builds on CEP-45: Mutation Tracking to establish the safety property 
> of the witness: guaranteeing that writes have been persisted to all durable 
> replicas before becoming purgeable. CEP-45's journal and reconciliation 
> design provide a great mechanism to ensure this while avoiding the write 
> amplification of incremental repair and anticompaction.
>
> Take a look at the CEP if you're interested - happy to answer questions and 
> discuss further: 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking
>
> – Scott
>
> [1] https://cloud.google.com/spanner/docs/replication
> [2] https://www.usenix.org/system/files/atc22-elhemali.pdf
>
>> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg 
>> <ar...@weisberg.ws<mailto:ar...@weisberg.ws>> wrote:
>>
>> Hi all,
>>
>> The CEP is available here: 
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959
>>
>> We would like to propose CEP-46: Finish Transient Replication/Witnesses for 
>> adoption by the community. CEP-46 would rename transient replication to 
>> witnesses and leverage mutation tracking to implement witnesses as CEP-45 
>> Mutation Tracking based Log Replicas as a replacement for incremental repair 
>> based witnesses.
>>
>> For those not familiar with transient replication it would have the keyspace 
>> replication settings declare some replicas as transient and when incremental 
>> repair runs the transient replicas would delete data instead of moving it 
>> into the repaired set.
>>
>> With log replicas nodes only  materialize mutations in their local LSM for 
>> ranges where they are full replicas and not witnesses. For witness ranges a 
>> node will write mutations to their local mutation tracking log and 
>> participate in background and read time reconciliation. This saves the 
>> compaction overhead of IR based witnesses which have to materialize and 
>> perform compaction on all mutations even those being applied to witness 
>> ranges.
>>
>> This would address one of the biggest issues with witnesses which is the 
>> lack of monotonic reads. Implementation complexity wise this would actually 
>> delete code compared to what would be required to complete IR based 
>> witnesses because most of the heavy lifting is already done by mutation 
>> tracking.
>>
>> Log replicas also makes it much more practical to realize the cost savings 
>> of witnesses because log replicas have easier to characterize resource 
>> consumption requirements (write rate * recovery/reconfiguration time) and 
>> target a 10x improvement in write throughput.  This makes knowing how much 
>> capacity can be omitted safer and easier.
>>
>> Thanks,
>> Ariel
>

Reply via email to