Hey everybody, bumping this CEP from Ariel in case you'd like some weekend reading.
We’d like to finish witnesses and bring them out of “experimental” status now that Transactional Metadata and Mutation Tracking provide the building blocks needed to complete them. Witnesses are part of a family of approaches in replicated storage systems to maintain or boost availability and durability while reducing storage costs. Log replicas are a close relative. Both are used by leading cloud databases – for instance, Spanner implements witness replicas [1] while DynamoDB implements log replicas [2]. Witness replicas are a great fit for topologies that replicate at greater than RF=3 –– most commonly multi-DC/multi-region deployments. Today in Cassandra, all members of a voting quorum replicate all data forever. Witness replicas let users break this coupling. They allow one to define voting quorums that are larger than the number of copies of data that are stored in perpetuity. Take a 3× DC cluster replicated at RF=3 in each DC as an example. In this topology, Cassandra stores 9× copies of the database forever - huge storage amplification. Witnesses allow users to maintain a voting quorum of 9 members (3× per DC); but reduce the durable replicas to 2× per DC – e.g., two durable replicas and one witness. This maintains the availability properties of an RF=3×3 topology while reducing storage costs by 33%, going from 9× copies to 6×. The role of a witness is to "witness" a write and persist it until it has been reconciled among all durable replicas; and to respond to read requests for witnessed writes awaiting reconciliation. Note that witnesses don't introduce a dedicated role for a node – whether a node is a durable replica or witness for a token just depends on its position in the ring. This CEP builds on CEP-45: Mutation Tracking to establish the safety property of the witness: guaranteeing that writes have been persisted to all durable replicas before becoming purgeable. CEP-45's journal and reconciliation design provide a great mechanism to ensure this while avoiding the write amplification of incremental repair and anticompaction. Take a look at the CEP if you're interested - happy to answer questions and discuss further: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking – Scott [1] https://cloud.google.com/spanner/docs/replication [2] https://www.usenix.org/system/files/atc22-elhemali.pdf > On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: > > Hi all, > > The CEP is available here: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959 > > We would like to propose CEP-46: Finish Transient Replication/Witnesses for > adoption by the community. CEP-46 would rename transient replication to > witnesses and leverage mutation tracking to implement witnesses as CEP-45 > Mutation Tracking based Log Replicas as a replacement for incremental repair > based witnesses. > > For those not familiar with transient replication it would have the keyspace > replication settings declare some replicas as transient and when incremental > repair runs the transient replicas would delete data instead of moving it > into the repaired set. > > With log replicas nodes only materialize mutations in their local LSM for > ranges where they are full replicas and not witnesses. For witness ranges a > node will write mutations to their local mutation tracking log and > participate in background and read time reconciliation. This saves the > compaction overhead of IR based witnesses which have to materialize and > perform compaction on all mutations even those being applied to witness > ranges. > > This would address one of the biggest issues with witnesses which is the lack > of monotonic reads. Implementation complexity wise this would actually delete > code compared to what would be required to complete IR based witnesses > because most of the heavy lifting is already done by mutation tracking. > > Log replicas also makes it much more practical to realize the cost savings of > witnesses because log replicas have easier to characterize resource > consumption requirements (write rate * recovery/reconfiguration time) and > target a 10x improvement in write throughput. This makes knowing how much > capacity can be omitted safer and easier. > > Thanks, > Ariel