I’m generally supportive. The concept is one that I can see the benefits of and I also think the current implementation adds a lot of complexity to the codebase for being stuck in experimental mode. It will be great to have a more robust version built on a better approach.
On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org> wrote: > +1 > > This is an obviously good feature for operators that are storage-bound in > multi-DC deployments but want to retain their latency characteristics > during node maintenance. Log replicas are the right approach. > > > On 3 May 2025, at 23:42, sc...@paradoxica.net wrote: > > > > Hey everybody, bumping this CEP from Ariel in case you'd like some > weekend reading. > > > > We’d like to finish witnesses and bring them out of “experimental” > status now that Transactional Metadata and Mutation Tracking provide the > building blocks needed to complete them. > > > > Witnesses are part of a family of approaches in replicated storage > systems to maintain or boost availability and durability while reducing > storage costs. Log replicas are a close relative. Both are used by leading > cloud databases – for instance, Spanner implements witness replicas [1] > while DynamoDB implements log replicas [2]. > > > > Witness replicas are a great fit for topologies that replicate at > greater than RF=3 –– most commonly multi-DC/multi-region deployments. Today > in Cassandra, all members of a voting quorum replicate all data forever. > Witness replicas let users break this coupling. They allow one to define > voting quorums that are larger than the number of copies of data that are > stored in perpetuity. > > > > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In > this topology, Cassandra stores 9× copies of the database forever - huge > storage amplification. Witnesses allow users to maintain a voting quorum of > 9 members (3× per DC); but reduce the durable replicas to 2× per DC – e.g., > two durable replicas and one witness. This maintains the availability > properties of an RF=3×3 topology while reducing storage costs by 33%, going > from 9× copies to 6×. > > > > The role of a witness is to "witness" a write and persist it until it > has been reconciled among all durable replicas; and to respond to read > requests for witnessed writes awaiting reconciliation. Note that witnesses > don't introduce a dedicated role for a node – whether a node is a durable > replica or witness for a token just depends on its position in the ring. > > > > This CEP builds on CEP-45: Mutation Tracking to establish the safety > property of the witness: guaranteeing that writes have been persisted to > all durable replicas before becoming purgeable. CEP-45's journal and > reconciliation design provide a great mechanism to ensure this while > avoiding the write amplification of incremental repair and anticompaction. > > > > Take a look at the CEP if you're interested - happy to answer questions > and discuss further: > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking > > > > – Scott > > > > [1] https://cloud.google.com/spanner/docs/replication > > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf > > > >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: > >> > >> Hi all, > >> > >> The CEP is available here: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959 > >> > >> We would like to propose CEP-46: Finish Transient Replication/Witnesses > for adoption by the community. CEP-46 would rename transient replication to > witnesses and leverage mutation tracking to implement witnesses as CEP-45 > Mutation Tracking based Log Replicas as a replacement for incremental > repair based witnesses. > >> > >> For those not familiar with transient replication it would have the > keyspace replication settings declare some replicas as transient and when > incremental repair runs the transient replicas would delete data instead of > moving it into the repaired set. > >> > >> With log replicas nodes only materialize mutations in their local LSM > for ranges where they are full replicas and not witnesses. For witness > ranges a node will write mutations to their local mutation tracking log and > participate in background and read time reconciliation. This saves the > compaction overhead of IR based witnesses which have to materialize and > perform compaction on all mutations even those being applied to witness > ranges. > >> > >> This would address one of the biggest issues with witnesses which is > the lack of monotonic reads. Implementation complexity wise this would > actually delete code compared to what would be required to complete IR > based witnesses because most of the heavy lifting is already done by > mutation tracking. > >> > >> Log replicas also makes it much more practical to realize the cost > savings of witnesses because log replicas have easier to characterize > resource consumption requirements (write rate * recovery/reconfiguration > time) and target a 10x improvement in write throughput. This makes knowing > how much capacity can be omitted safer and easier. > >> > >> Thanks, > >> Ariel > > >