Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Dmitry Konstantinov Tue, 06 May 2025 10:53:48 -0700

+1 (nb)

On Tue, 6 May 2025 at 17:32, Aleksey Yeshchenko <alek...@apple.com> wrote:


> +1
>
> On 5 May 2025, at 23:24, Blake Eggleston <bl...@ultrablake.com> wrote:
>
> As mutation tracking relates to existing backup systems that account for
> repaired vs unrepaired sstables, mutation tracking will continue to promote
> sstables to repaired once we know they contain data that has been fully
> reconciled. The main difference is that they won’t be promoted as part of
> an explicit range repair, but by compaction as they’re able to be promoted.
>
> (also +1 to finishing witnesses)
>
> On Mon, May 5, 2025, at 11:45 AM, Benedict Elliott Smith wrote:
>
> Consistent backup/restore is a fundamentally hard and unsolved problem for
> Cassandra today (without any of the mentioned features). In particular, we
> break the real-time guarantee of the linearizability property (most notably
> for LWTs) between partitions for any backup/restore process today.
>
> Fixing this should be relatively straight-forward for Accord, and
> something we intend to address in follow-up work. Fixing it for eventually
> consistent (or Paxos/LWT) operations is I think achievable, with or without
> mutation tracking (probably easier with mutation tracking). I’m not sure of
> any plans to try to tackle this though.
>
> Witness replicas should not particularly matter at all to any of the above.
>
> On 5 May 2025, at 18:49, Jon Haddad <j...@rustyrazorblade.com> wrote:
>
> It took me a bit to wrap my head around how this works, but now that I
> think I understand the idea, it sounds like a solid improvement.  Being
> able to achieve the same results as quorum but costing 1/3 less is a *big
> deal* and I know several teams that would be interested.
>
> One thing I'm curious about (and we can break it out into a separate
> discussion), is how all the functionality that requires coordination and
> global state (repaired vs non-repaired) will affect backups.  Without a
> synchronization primitive to take a cluster-wide snapshot, how can we
> safely restore from eventually consistent backups without risking
> consistency issues due to out-of-sync repaired status?
>
> I don't think we need to block any of the proposed work on this - it's
> just something that's been nagging at me, and I don't know enough about the
> nuance of Accord, Mutation Tracking or Witness Replicas to say if it
> affects things or not.  If it does, let's make sure we have that documented
> [1]
>
> Jon
>
> [1]
> https://cassandra.apache.org/doc/latest/cassandra/managing/operating/backups.html
>
>
>
> On Mon, May 5, 2025 at 10:21 AM Nate McCall <zznat...@gmail.com> wrote:
>
> This sounds like a modern feature that will benefit a lot of folks in
> cutting storage costs, particularly in large deployments.
>
> I'd like to see a note on the CEP about documentation overhead as this is
> an important feature to communicate correctly, but that's just a nit. +1 on
> moving forward with this overall.
>
> On Sun, May 4, 2025 at 1:58 PM Jordan West <jw...@apache.org> wrote:
>
> I’m generally supportive. The concept is one that I can see the benefits
> of and I also think the current implementation adds a lot of complexity to
> the codebase for being stuck in experimental mode. It will be great to have
> a more robust version built on a better approach.
>
> On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org> wrote:
>
> +1
>
> This is an obviously good feature for operators that are storage-bound in
> multi-DC deployments but want to retain their latency characteristics
> during node maintenance. Log replicas are the right approach.
>
> > On 3 May 2025, at 23:42, sc...@paradoxica.net wrote:
> >
> > Hey everybody, bumping this CEP from Ariel in case you'd like some
> weekend reading.
> >
> > We’d like to finish witnesses and bring them out of “experimental”
> status now that Transactional Metadata and Mutation Tracking provide the
> building blocks needed to complete them.
> >
> > Witnesses are part of a family of approaches in replicated storage
> systems to maintain or boost availability and durability while reducing
> storage costs. Log replicas are a close relative. Both are used by leading
> cloud databases – for instance, Spanner implements witness replicas [1]
> while DynamoDB implements log replicas [2].
> >
> > Witness replicas are a great fit for topologies that replicate at
> greater than RF=3 –– most commonly multi-DC/multi-region deployments. Today
> in Cassandra, all members of a voting quorum replicate all data forever.
> Witness replicas let users break this coupling. They allow one to define
> voting quorums that are larger than the number of copies of data that are
> stored in perpetuity.
> >
> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In
> this topology, Cassandra stores 9× copies of the database forever - huge
> storage amplification. Witnesses allow users to maintain a voting quorum of
> 9 members (3× per DC); but reduce the durable replicas to 2× per DC – e.g.,
> two durable replicas and one witness. This maintains the availability
> properties of an RF=3×3 topology while reducing storage costs by 33%, going
> from 9× copies to 6×.
> >
> > The role of a witness is to "witness" a write and persist it until it
> has been reconciled among all durable replicas; and to respond to read
> requests for witnessed writes awaiting reconciliation. Note that witnesses
> don't introduce a dedicated role for a node – whether a node is a durable
> replica or witness for a token just depends on its position in the ring.
> >
> > This CEP builds on CEP-45: Mutation Tracking to establish the safety
> property of the witness: guaranteeing that writes have been persisted to
> all durable replicas before becoming purgeable. CEP-45's journal and
> reconciliation design provide a great mechanism to ensure this while
> avoiding the write amplification of incremental repair and anticompaction.
> >
> > Take a look at the CEP if you're interested - happy to answer questions
> and discuss further:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking
> >
> > – Scott
> >
> > [1] https://cloud.google.com/spanner/docs/replication
> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf
> >
> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
> >>
> >> Hi all,
> >>
> >> The CEP is available here:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959
> >>
> >> We would like to propose CEP-46: Finish Transient Replication/Witnesses
> for adoption by the community. CEP-46 would rename transient replication to
> witnesses and leverage mutation tracking to implement witnesses as CEP-45
> Mutation Tracking based Log Replicas as a replacement for incremental
> repair based witnesses.
> >>
> >> For those not familiar with transient replication it would have the
> keyspace replication settings declare some replicas as transient and when
> incremental repair runs the transient replicas would delete data instead of
> moving it into the repaired set.
> >>
> >> With log replicas nodes only  materialize mutations in their local LSM
> for ranges where they are full replicas and not witnesses. For witness
> ranges a node will write mutations to their local mutation tracking log and
> participate in background and read time reconciliation. This saves the
> compaction overhead of IR based witnesses which have to materialize and
> perform compaction on all mutations even those being applied to witness
> ranges.
> >>
> >> This would address one of the biggest issues with witnesses which is
> the lack of monotonic reads. Implementation complexity wise this would
> actually delete code compared to what would be required to complete IR
> based witnesses because most of the heavy lifting is already done by
> mutation tracking.
> >>
> >> Log replicas also makes it much more practical to realize the cost
> savings of witnesses because log replicas have easier to characterize
> resource consumption requirements (write rate * recovery/reconfiguration
> time) and target a 10x improvement in write throughput.  This makes knowing
> how much capacity can be omitted safer and easier.
> >>
> >> Thanks,
> >> Ariel
> >
>
>
>
>

-- 
Dmitry Konstantinov

Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Reply via email to