+1 (nb) ________________________________ From: Ariel Weisberg <ar...@weisberg.ws> Sent: Tuesday, May 6, 2025 12:59:09 PM To: Claude Warren, Jr <dev@cassandra.apache.org> Subject: Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses
Hi, On Sun, May 4, 2025, at 4:57 PM, Jordan West wrote: I’m generally supportive. The concept is one that I can see the benefits of and I also think the current implementation adds a lot of complexity to the codebase for being stuck in experimental mode. It will be great to have a more robust version built on a better approach. One of the great things about this is that it actually deletes and simplifies implementation code. If you ignore the hat trick of mutation tracking making it possible to have log only replication of course. So far it's been mostly deleted and changed lines to get the single partition read, range read, and write path working. A lot of the code already exists for transient replication so it's changed rather than new code. PaxosV2 and Accord will both need to become witness aware and that will be new code, but it's relatively straightforward in that it's just picking full replicas for reads. On Mon, May 5, 2025, at 1:21 PM, Nate McCall wrote: I'd like to see a note on the CEP about documentation overhead as this is an important feature to communicate correctly, but that's just a nit. +1 on moving forward with this overall. There is documentation for transient replication https://cassandra.apache.org/doc/4.0/cassandra/new/transientreplication.html which needs to be promoted out of "What's new", updated, and linked to the documentation for mutation tracking. I'll update the CEP to cover this. On Mon, May 5, 2025, at 1:49 PM, Jon Haddad wrote: It took me a bit to wrap my head around how this works, but now that I think I understand the idea, it sounds like a solid improvement. Being able to achieve the same results as quorum but costing 1/3 less is a *big deal* and I know several teams that would be interested. 1/3rd is the "free" threshold where you don't give increase your probability of experiencing data loss using quorums for common topologies. If you have a lot of replicas because say you want copies in many places you might be able to reduce further. Voting on what the value is is basically decoupled from how redundantly that value is stored long term. One thing I'm curious about (and we can break it out into a separate discussion), is how all the functionality that requires coordination and global state (repaired vs non-repaired) will affect backups. Without a synchronization primitive to take a cluster-wide snapshot, how can we safely restore from eventually consistent backups without risking consistency issues due to out-of-sync repaired status? Witnesses doesn't make the consistency of backups better/worse, but it does add a little bit of complexity if your backups are copying only the repaired data. The procedure you follow today where you copy the repaired tables from a range from a single replica and copy the unrepaired tables from a quorum would continue to apply. The added constraint with witnesses is that the single replica you are picking to copy repaired sstables from needs to be a full replica not a witness for that range. I don't think we have a way to get a consistent snapshot right now? Like there isn't even "run repair and repair will create a consistent snapshot for you to copy as a backup". And then as Benedict points out LWT (with async commit) and Accord (also defaults to async commit, has multi-key transactions that can be torn) both don't make for consistent backups. We definitely need to follow up with leveraging new replication/transactions schemes to produce more consistent backups. Ariel On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org<mailto:bened...@apache.org>> wrote: +1 This is an obviously good feature for operators that are storage-bound in multi-DC deployments but want to retain their latency characteristics during node maintenance. Log replicas are the right approach. > On 3 May 2025, at 23:42, sc...@paradoxica.net<mailto:sc...@paradoxica.net> > wrote: > > Hey everybody, bumping this CEP from Ariel in case you'd like some weekend > reading. > > We’d like to finish witnesses and bring them out of “experimental” status now > that Transactional Metadata and Mutation Tracking provide the building blocks > needed to complete them. > > Witnesses are part of a family of approaches in replicated storage systems to > maintain or boost availability and durability while reducing storage costs. > Log replicas are a close relative. Both are used by leading cloud databases – > for instance, Spanner implements witness replicas [1] while DynamoDB > implements log replicas [2]. > > Witness replicas are a great fit for topologies that replicate at greater > than RF=3 –– most commonly multi-DC/multi-region deployments. Today in > Cassandra, all members of a voting quorum replicate all data forever. Witness > replicas let users break this coupling. They allow one to define voting > quorums that are larger than the number of copies of data that are stored in > perpetuity. > > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In this > topology, Cassandra stores 9× copies of the database forever - huge storage > amplification. Witnesses allow users to maintain a voting quorum of 9 members > (3× per DC); but reduce the durable replicas to 2× per DC – e.g., two durable > replicas and one witness. This maintains the availability properties of an > RF=3×3 topology while reducing storage costs by 33%, going from 9× copies to > 6×. > > The role of a witness is to "witness" a write and persist it until it has > been reconciled among all durable replicas; and to respond to read requests > for witnessed writes awaiting reconciliation. Note that witnesses don't > introduce a dedicated role for a node – whether a node is a durable replica > or witness for a token just depends on its position in the ring. > > This CEP builds on CEP-45: Mutation Tracking to establish the safety property > of the witness: guaranteeing that writes have been persisted to all durable > replicas before becoming purgeable. CEP-45's journal and reconciliation > design provide a great mechanism to ensure this while avoiding the write > amplification of incremental repair and anticompaction. > > Take a look at the CEP if you're interested - happy to answer questions and > discuss further: > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking > > – Scott > > [1] https://cloud.google.com/spanner/docs/replication > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf > >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg >> <ar...@weisberg.ws<mailto:ar...@weisberg.ws>> wrote: >> >> Hi all, >> >> The CEP is available here: >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959 >> >> We would like to propose CEP-46: Finish Transient Replication/Witnesses for >> adoption by the community. CEP-46 would rename transient replication to >> witnesses and leverage mutation tracking to implement witnesses as CEP-45 >> Mutation Tracking based Log Replicas as a replacement for incremental repair >> based witnesses. >> >> For those not familiar with transient replication it would have the keyspace >> replication settings declare some replicas as transient and when incremental >> repair runs the transient replicas would delete data instead of moving it >> into the repaired set. >> >> With log replicas nodes only materialize mutations in their local LSM for >> ranges where they are full replicas and not witnesses. For witness ranges a >> node will write mutations to their local mutation tracking log and >> participate in background and read time reconciliation. This saves the >> compaction overhead of IR based witnesses which have to materialize and >> perform compaction on all mutations even those being applied to witness >> ranges. >> >> This would address one of the biggest issues with witnesses which is the >> lack of monotonic reads. Implementation complexity wise this would actually >> delete code compared to what would be required to complete IR based >> witnesses because most of the heavy lifting is already done by mutation >> tracking. >> >> Log replicas also makes it much more practical to realize the cost savings >> of witnesses because log replicas have easier to characterize resource >> consumption requirements (write rate * recovery/reconfiguration time) and >> target a 10x improvement in write throughput. This makes knowing how much >> capacity can be omitted safer and easier. >> >> Thanks, >> Ariel >