+1
On Tue, May 6, 2025, at 4:06 PM, Yifan Cai wrote: > +1 (nb) > > > > *From:* Ariel Weisberg <ar...@weisberg.ws> > *Sent:* Tuesday, May 6, 2025 12:59:09 PM > *To:* Claude Warren, Jr <dev@cassandra.apache.org> > *Subject:* Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses > > Hi, > > On Sun, May 4, 2025, at 4:57 PM, Jordan West wrote: >> I’m generally supportive. The concept is one that I can see the benefits of >> and I also think the current implementation adds a lot of complexity to the >> codebase for being stuck in experimental mode. It will be great to have a >> more robust version built on a better approach. > > One of the great things about this is that it actually deletes and simplifies > implementation code. If you ignore the hat trick of mutation tracking making > it possible to have log only replication of course. > > So far it's been mostly deleted and changed lines to get the single partition > read, range read, and write path working. A lot of the code already exists > for transient replication so it's changed rather than new code. PaxosV2 and > Accord will both need to become witness aware and that will be new code, but > it's relatively straightforward in that it's just picking full replicas for > reads. > > On Mon, May 5, 2025, at 1:21 PM, Nate McCall wrote: >> I'd like to see a note on the CEP about documentation overhead as this is an >> important feature to communicate correctly, but that's just a nit. +1 on >> moving forward with this overall. > There is documentation for transient replication > https://cassandra.apache.org/doc/4.0/cassandra/new/transientreplication.html > which needs to be promoted out of "What's new", updated, and linked to the > documentation for mutation tracking. I'll update the CEP to cover this. > > > On Mon, May 5, 2025, at 1:49 PM, Jon Haddad wrote: >> It took me a bit to wrap my head around how this works, but now that I think >> I understand the idea, it sounds like a solid improvement. Being able to >> achieve the same results as quorum but costing 1/3 less is a *big deal* and >> I know several teams that would be interested. > 1/3rd is the "free" threshold where you don't give increase your probability > of experiencing data loss using quorums for common topologies. If you have a > lot of replicas because say you want copies in many places you might be able > to reduce further. Voting on what the value is is basically decoupled from > how redundantly that value is stored long term. >> One thing I'm curious about (and we can break it out into a separate >> discussion), is how all the functionality that requires coordination and >> global state (repaired vs non-repaired) will affect backups. Without a >> synchronization primitive to take a cluster-wide snapshot, how can we safely >> restore from eventually consistent backups without risking consistency >> issues due to out-of-sync repaired status? > Witnesses doesn't make the consistency of backups better/worse, but it does > add a little bit of complexity if your backups are copying only the repaired > data. > > The procedure you follow today where you copy the repaired tables from a > range from a single replica and copy the unrepaired tables from a quorum > would continue to apply. The added constraint with witnesses is that the > single replica you are picking to copy repaired sstables from needs to be a > full replica not a witness for that range. > > I don't think we have a way to get a consistent snapshot right now? Like > there isn't even "run repair and repair will create a consistent snapshot for > you to copy as a backup". And then as Benedict points out LWT (with async > commit) and Accord (also defaults to async commit, has multi-key transactions > that can be torn) both don't make for consistent backups. > > We definitely need to follow up with leveraging new replication/transactions > schemes to produce more consistent backups. > > Ariel >> >> On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org> wrote: >>> +1 >>> >>> This is an obviously good feature for operators that are storage-bound in >>> multi-DC deployments but want to retain their latency characteristics >>> during node maintenance. Log replicas are the right approach. >>> >>> > On 3 May 2025, at 23:42, sc...@paradoxica.net wrote: >>> > >>> > Hey everybody, bumping this CEP from Ariel in case you'd like some >>> > weekend reading. >>> > >>> > We’d like to finish witnesses and bring them out of “experimental” status >>> > now that Transactional Metadata and Mutation Tracking provide the >>> > building blocks needed to complete them. >>> > >>> > Witnesses are part of a family of approaches in replicated storage >>> > systems to maintain or boost availability and durability while reducing >>> > storage costs. Log replicas are a close relative. Both are used by >>> > leading cloud databases – for instance, Spanner implements witness >>> > replicas [1] while DynamoDB implements log replicas [2]. >>> > >>> > Witness replicas are a great fit for topologies that replicate at greater >>> > than RF=3 –– most commonly multi-DC/multi-region deployments. Today in >>> > Cassandra, all members of a voting quorum replicate all data forever. >>> > Witness replicas let users break this coupling. They allow one to define >>> > voting quorums that are larger than the number of copies of data that are >>> > stored in perpetuity. >>> > >>> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In this >>> > topology, Cassandra stores 9× copies of the database forever - huge >>> > storage amplification. Witnesses allow users to maintain a voting quorum >>> > of 9 members (3× per DC); but reduce the durable replicas to 2× per DC – >>> > e.g., two durable replicas and one witness. This maintains the >>> > availability properties of an RF=3×3 topology while reducing storage >>> > costs by 33%, going from 9× copies to 6×. >>> > >>> > The role of a witness is to "witness" a write and persist it until it has >>> > been reconciled among all durable replicas; and to respond to read >>> > requests for witnessed writes awaiting reconciliation. Note that >>> > witnesses don't introduce a dedicated role for a node – whether a node is >>> > a durable replica or witness for a token just depends on its position in >>> > the ring. >>> > >>> > This CEP builds on CEP-45: Mutation Tracking to establish the safety >>> > property of the witness: guaranteeing that writes have been persisted to >>> > all durable replicas before becoming purgeable. CEP-45's journal and >>> > reconciliation design provide a great mechanism to ensure this while >>> > avoiding the write amplification of incremental repair and anticompaction. >>> > >>> > Take a look at the CEP if you're interested - happy to answer questions >>> > and discuss further: >>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking >>> > >>> > – Scott >>> > >>> > [1] https://cloud.google.com/spanner/docs/replication >>> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf >>> > >>> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>> >> >>> >> Hi all, >>> >> >>> >> The CEP is available here: >>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959 >>> >> >>> >> We would like to propose CEP-46: Finish Transient Replication/Witnesses >>> >> for adoption by the community. CEP-46 would rename transient replication >>> >> to witnesses and leverage mutation tracking to implement witnesses as >>> >> CEP-45 Mutation Tracking based Log Replicas as a replacement for >>> >> incremental repair based witnesses. >>> >> >>> >> For those not familiar with transient replication it would have the >>> >> keyspace replication settings declare some replicas as transient and >>> >> when incremental repair runs the transient replicas would delete data >>> >> instead of moving it into the repaired set. >>> >> >>> >> With log replicas nodes only materialize mutations in their local LSM >>> >> for ranges where they are full replicas and not witnesses. For witness >>> >> ranges a node will write mutations to their local mutation tracking log >>> >> and participate in background and read time reconciliation. This saves >>> >> the compaction overhead of IR based witnesses which have to materialize >>> >> and perform compaction on all mutations even those being applied to >>> >> witness ranges. >>> >> >>> >> This would address one of the biggest issues with witnesses which is the >>> >> lack of monotonic reads. Implementation complexity wise this would >>> >> actually delete code compared to what would be required to complete IR >>> >> based witnesses because most of the heavy lifting is already done by >>> >> mutation tracking. >>> >> >>> >> Log replicas also makes it much more practical to realize the cost >>> >> savings of witnesses because log replicas have easier to characterize >>> >> resource consumption requirements (write rate * recovery/reconfiguration >>> >> time) and target a 10x improvement in write throughput. This makes >>> >> knowing how much capacity can be omitted safer and easier. >>> >> >>> >> Thanks, >>> >> Ariel >>> > >