Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Bernardo Botella Tue, 06 May 2025 13:34:56 -0700
+1 (nb)

> On May 6, 2025, at 1:19 PM, Josh McKenzie <[email protected]> wrote:
> 
> +1
> 
> On Tue, May 6, 2025, at 4:06 PM, Yifan Cai wrote:
>> +1 (nb)
>> 
>> 
>> 
>> From: Ariel Weisberg <[email protected]>
>> Sent: Tuesday, May 6, 2025 12:59:09 PM
>> To: Claude Warren, Jr <[email protected]>
>> Subject: Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses
>>  
>> Hi,
>> 
>> On Sun, May 4, 2025, at 4:57 PM, Jordan West wrote:
>>> I’m generally supportive. The concept is one that I can see the benefits of 
>>> and I also think the current implementation adds a lot of complexity to the 
>>> codebase for being stuck in experimental mode. It will be great to have a 
>>> more robust version built on a better approach. 
>> 
>> 
>> One of the great things about this is that it actually deletes and 
>> simplifies implementation code. If you ignore the hat trick of mutation 
>> tracking making it possible to have log only replication of course.
>> 
>> So far it's been mostly deleted and changed lines to get the single 
>> partition read, range read, and write path working. A lot of the code 
>> already exists for transient replication so it's changed rather than new 
>> code. PaxosV2 and Accord will both need to become witness aware and that 
>> will be new code, but it's relatively straightforward in that it's just 
>> picking full replicas for reads.
>> 
>> On Mon, May 5, 2025, at 1:21 PM, Nate McCall wrote:
>>> I'd like to see a note on the CEP about documentation overhead as this is 
>>> an important feature to communicate correctly, but that's just a nit. +1 on 
>>> moving forward with this overall. 
>> There is documentation for transient replication 
>> https://cassandra.apache.org/doc/4.0/cassandra/new/transientreplication.html 
>> which needs to be promoted out of "What's new", updated, and linked to the 
>> documentation for mutation tracking. I'll update the CEP to cover this.
>> 
>> 
>> On Mon, May 5, 2025, at 1:49 PM, Jon Haddad wrote:
>>> It took me a bit to wrap my head around how this works, but now that I 
>>> think I understand the idea, it sounds like a solid improvement.  Being 
>>> able to achieve the same results as quorum but costing 1/3 less is a *big 
>>> deal* and I know several teams that would be interested.
>> 1/3rd is the "free" threshold where you don't give increase your probability 
>> of experiencing data loss using quorums for common topologies. If you have a 
>> lot of replicas because say you want copies in many places you might be able 
>> to reduce further. Voting on what the value is is basically decoupled from 
>> how redundantly that value is stored long term.
>>> One thing I'm curious about (and we can break it out into a separate 
>>> discussion), is how all the functionality that requires coordination and 
>>> global state (repaired vs non-repaired) will affect backups.  Without a 
>>> synchronization primitive to take a cluster-wide snapshot, how can we 
>>> safely restore from eventually consistent backups without risking 
>>> consistency issues due to out-of-sync repaired status?
>> Witnesses doesn't make the consistency of backups better/worse, but it does 
>> add a little bit of complexity if your backups are copying only the repaired 
>> data.
>> 
>> The procedure you follow today where you copy the repaired tables from a 
>> range from a single replica and copy the unrepaired tables from a quorum 
>> would continue to apply. The added constraint with witnesses is that the 
>> single replica you are picking to copy repaired sstables from needs to be a 
>> full replica not a witness for that range.
>> 
>> I don't think we have a way to get a consistent snapshot right now? Like 
>> there isn't even "run repair and repair will create a consistent snapshot 
>> for you to copy as a backup". And then as Benedict points out LWT (with 
>> async commit) and Accord (also defaults to async commit, has multi-key 
>> transactions that can be torn) both don't make for consistent backups.
>> 
>> We definitely need to follow up with leveraging new replication/transactions 
>> schemes to produce more consistent backups.
>> 
>> Ariel
>>> 
>>> On Sun, May 4, 2025 at 00:27 Benedict <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> +1
>>> 
>>> This is an obviously good feature for operators that are storage-bound in 
>>> multi-DC deployments but want to retain their latency characteristics 
>>> during node maintenance. Log replicas are the right approach.
>>> 
>>> > On 3 May 2025, at 23:42, [email protected] 
>>> > <mailto:[email protected]> wrote:
>>> >
>>> > Hey everybody, bumping this CEP from Ariel in case you'd like some 
>>> > weekend reading.
>>> >
>>> > We’d like to finish witnesses and bring them out of “experimental” status 
>>> > now that Transactional Metadata and Mutation Tracking provide the 
>>> > building blocks needed to complete them.
>>> >
>>> > Witnesses are part of a family of approaches in replicated storage 
>>> > systems to maintain or boost availability and durability while reducing 
>>> > storage costs. Log replicas are a close relative. Both are used by 
>>> > leading cloud databases – for instance, Spanner implements witness 
>>> > replicas [1] while DynamoDB implements log replicas [2].
>>> >
>>> > Witness replicas are a great fit for topologies that replicate at greater 
>>> > than RF=3 –– most commonly multi-DC/multi-region deployments. Today in 
>>> > Cassandra, all members of a voting quorum replicate all data forever. 
>>> > Witness replicas let users break this coupling. They allow one to define 
>>> > voting quorums that are larger than the number of copies of data that are 
>>> > stored in perpetuity.
>>> >
>>> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In this 
>>> > topology, Cassandra stores 9× copies of the database forever - huge 
>>> > storage amplification. Witnesses allow users to maintain a voting quorum 
>>> > of 9 members (3× per DC); but reduce the durable replicas to 2× per DC – 
>>> > e.g., two durable replicas and one witness. This maintains the 
>>> > availability properties of an RF=3×3 topology while reducing storage 
>>> > costs by 33%, going from 9× copies to 6×.
>>> >
>>> > The role of a witness is to "witness" a write and persist it until it has 
>>> > been reconciled among all durable replicas; and to respond to read 
>>> > requests for witnessed writes awaiting reconciliation. Note that 
>>> > witnesses don't introduce a dedicated role for a node – whether a node is 
>>> > a durable replica or witness for a token just depends on its position in 
>>> > the ring.
>>> >
>>> > This CEP builds on CEP-45: Mutation Tracking to establish the safety 
>>> > property of the witness: guaranteeing that writes have been persisted to 
>>> > all durable replicas before becoming purgeable. CEP-45's journal and 
>>> > reconciliation design provide a great mechanism to ensure this while 
>>> > avoiding the write amplification of incremental repair and anticompaction.
>>> >
>>> > Take a look at the CEP if you're interested - happy to answer questions 
>>> > and discuss further: 
>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking
>>> >
>>> > – Scott
>>> >
>>> > [1] https://cloud.google.com/spanner/docs/replication
>>> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf
>>> >
>>> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <[email protected] 
>>> >> <mailto:[email protected]>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> The CEP is available here: 
>>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959
>>> >>
>>> >> We would like to propose CEP-46: Finish Transient Replication/Witnesses 
>>> >> for adoption by the community. CEP-46 would rename transient replication 
>>> >> to witnesses and leverage mutation tracking to implement witnesses as 
>>> >> CEP-45 Mutation Tracking based Log Replicas as a replacement for 
>>> >> incremental repair based witnesses.
>>> >>
>>> >> For those not familiar with transient replication it would have the 
>>> >> keyspace replication settings declare some replicas as transient and 
>>> >> when incremental repair runs the transient replicas would delete data 
>>> >> instead of moving it into the repaired set.
>>> >>
>>> >> With log replicas nodes only  materialize mutations in their local LSM 
>>> >> for ranges where they are full replicas and not witnesses. For witness 
>>> >> ranges a node will write mutations to their local mutation tracking log 
>>> >> and participate in background and read time reconciliation. This saves 
>>> >> the compaction overhead of IR based witnesses which have to materialize 
>>> >> and perform compaction on all mutations even those being applied to 
>>> >> witness ranges.
>>> >>
>>> >> This would address one of the biggest issues with witnesses which is the 
>>> >> lack of monotonic reads. Implementation complexity wise this would 
>>> >> actually delete code compared to what would be required to complete IR 
>>> >> based witnesses because most of the heavy lifting is already done by 
>>> >> mutation tracking.
>>> >>
>>> >> Log replicas also makes it much more practical to realize the cost 
>>> >> savings of witnesses because log replicas have easier to characterize 
>>> >> resource consumption requirements (write rate * recovery/reconfiguration 
>>> >> time) and target a 10x improvement in write throughput.  This makes 
>>> >> knowing how much capacity can be omitted safer and easier.
>>> >>
>>> >> Thanks,
>>> >> Ariel
>>> >
>>
Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Reply via email to