Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Blake Eggleston Mon, 05 May 2025 15:25:31 -0700

As mutation tracking relates to existing backup systems that account for 
repaired vs unrepaired sstables, mutation tracking will continue to promote 
sstables to repaired once we know they contain data that has been fully 
reconciled. The main difference is that they won’t be promoted as part of an 
explicit range repair, but by compaction as they’re able to be promoted.


(also +1 to finishing witnesses)

On Mon, May 5, 2025, at 11:45 AM, Benedict Elliott Smith wrote:
> Consistent backup/restore is a fundamentally hard and unsolved problem for 
> Cassandra today (without any of the mentioned features). In particular, we 
> break the real-time guarantee of the linearizability property (most notably 
> for LWTs) between partitions for any backup/restore process today.
> 
> Fixing this should be relatively straight-forward for Accord, and something 
> we intend to address in follow-up work. Fixing it for eventually consistent 
> (or Paxos/LWT) operations is I think achievable, with or without mutation 
> tracking (probably easier with mutation tracking). I’m not sure of any plans 
> to try to tackle this though.
> 
> Witness replicas should not particularly matter at all to any of the above.
> 
>> On 5 May 2025, at 18:49, Jon Haddad <j...@rustyrazorblade.com> wrote:
>> 
>> It took me a bit to wrap my head around how this works, but now that I think 
>> I understand the idea, it sounds like a solid improvement.  Being able to 
>> achieve the same results as quorum but costing 1/3 less is a *big deal* and 
>> I know several teams that would be interested.
>> 
>> One thing I'm curious about (and we can break it out into a separate 
>> discussion), is how all the functionality that requires coordination and 
>> global state (repaired vs non-repaired) will affect backups.  Without a 
>> synchronization primitive to take a cluster-wide snapshot, how can we safely 
>> restore from eventually consistent backups without risking consistency 
>> issues due to out-of-sync repaired status?
>> 
>> I don't think we need to block any of the proposed work on this - it's just 
>> something that's been nagging at me, and I don't know enough about the 
>> nuance of Accord, Mutation Tracking or Witness Replicas to say if it affects 
>> things or not.  If it does, let's make sure we have that documented [1]
>> 
>> Jon
>> 
>> [1] 
>> https://cassandra.apache.org/doc/latest/cassandra/managing/operating/backups.html
>> 
>> 
>> 
>> On Mon, May 5, 2025 at 10:21 AM Nate McCall <zznat...@gmail.com> wrote:
>>> This sounds like a modern feature that will benefit a lot of folks in 
>>> cutting storage costs, particularly in large deployments.
>>> 
>>> I'd like to see a note on the CEP about documentation overhead as this is 
>>> an important feature to communicate correctly, but that's just a nit. +1 on 
>>> moving forward with this overall. 
>>> 
>>> On Sun, May 4, 2025 at 1:58 PM Jordan West <jw...@apache.org> wrote:
>>>> I’m generally supportive. The concept is one that I can see the benefits 
>>>> of and I also think the current implementation adds a lot of complexity to 
>>>> the codebase for being stuck in experimental mode. It will be great to 
>>>> have a more robust version built on a better approach. 
>>>> 
>>>> On Sun, May 4, 2025 at 00:27 Benedict <bened...@apache.org> wrote:
>>>>> +1
>>>>> 
>>>>> This is an obviously good feature for operators that are storage-bound in 
>>>>> multi-DC deployments but want to retain their latency characteristics 
>>>>> during node maintenance. Log replicas are the right approach.
>>>>> 
>>>>> > On 3 May 2025, at 23:42, sc...@paradoxica.net wrote:
>>>>> > 
>>>>> > Hey everybody, bumping this CEP from Ariel in case you'd like some 
>>>>> > weekend reading.
>>>>> > 
>>>>> > We’d like to finish witnesses and bring them out of “experimental” 
>>>>> > status now that Transactional Metadata and Mutation Tracking provide 
>>>>> > the building blocks needed to complete them.
>>>>> > 
>>>>> > Witnesses are part of a family of approaches in replicated storage 
>>>>> > systems to maintain or boost availability and durability while reducing 
>>>>> > storage costs. Log replicas are a close relative. Both are used by 
>>>>> > leading cloud databases – for instance, Spanner implements witness 
>>>>> > replicas [1] while DynamoDB implements log replicas [2].
>>>>> > 
>>>>> > Witness replicas are a great fit for topologies that replicate at 
>>>>> > greater than RF=3 –– most commonly multi-DC/multi-region deployments. 
>>>>> > Today in Cassandra, all members of a voting quorum replicate all data 
>>>>> > forever. Witness replicas let users break this coupling. They allow one 
>>>>> > to define voting quorums that are larger than the number of copies of 
>>>>> > data that are stored in perpetuity.
>>>>> > 
>>>>> > Take a 3× DC cluster replicated at RF=3 in each DC as an example. In 
>>>>> > this topology, Cassandra stores 9× copies of the database forever - 
>>>>> > huge storage amplification. Witnesses allow users to maintain a voting 
>>>>> > quorum of 9 members (3× per DC); but reduce the durable replicas to 2× 
>>>>> > per DC – e.g., two durable replicas and one witness. This maintains the 
>>>>> > availability properties of an RF=3×3 topology while reducing storage 
>>>>> > costs by 33%, going from 9× copies to 6×.
>>>>> > 
>>>>> > The role of a witness is to "witness" a write and persist it until it 
>>>>> > has been reconciled among all durable replicas; and to respond to read 
>>>>> > requests for witnessed writes awaiting reconciliation. Note that 
>>>>> > witnesses don't introduce a dedicated role for a node – whether a node 
>>>>> > is a durable replica or witness for a token just depends on its 
>>>>> > position in the ring.
>>>>> > 
>>>>> > This CEP builds on CEP-45: Mutation Tracking to establish the safety 
>>>>> > property of the witness: guaranteeing that writes have been persisted 
>>>>> > to all durable replicas before becoming purgeable. CEP-45's journal and 
>>>>> > reconciliation design provide a great mechanism to ensure this while 
>>>>> > avoiding the write amplification of incremental repair and 
>>>>> > anticompaction.
>>>>> > 
>>>>> > Take a look at the CEP if you're interested - happy to answer questions 
>>>>> > and discuss further: 
>>>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking
>>>>> > 
>>>>> > – Scott
>>>>> > 
>>>>> > [1] https://cloud.google.com/spanner/docs/replication
>>>>> > [2] https://www.usenix.org/system/files/atc22-elhemali.pdf
>>>>> > 
>>>>> >> On Apr 25, 2025, at 8:21 AM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>>> >> 
>>>>> >> Hi all,
>>>>> >> 
>>>>> >> The CEP is available here: 
>>>>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=353601959
>>>>> >> 
>>>>> >> We would like to propose CEP-46: Finish Transient 
>>>>> >> Replication/Witnesses for adoption by the community. CEP-46 would 
>>>>> >> rename transient replication to witnesses and leverage mutation 
>>>>> >> tracking to implement witnesses as CEP-45 Mutation Tracking based Log 
>>>>> >> Replicas as a replacement for incremental repair based witnesses.
>>>>> >> 
>>>>> >> For those not familiar with transient replication it would have the 
>>>>> >> keyspace replication settings declare some replicas as transient and 
>>>>> >> when incremental repair runs the transient replicas would delete data 
>>>>> >> instead of moving it into the repaired set.
>>>>> >> 
>>>>> >> With log replicas nodes only  materialize mutations in their local LSM 
>>>>> >> for ranges where they are full replicas and not witnesses. For witness 
>>>>> >> ranges a node will write mutations to their local mutation tracking 
>>>>> >> log and participate in background and read time reconciliation. This 
>>>>> >> saves the compaction overhead of IR based witnesses which have to 
>>>>> >> materialize and perform compaction on all mutations even those being 
>>>>> >> applied to witness ranges.
>>>>> >> 
>>>>> >> This would address one of the biggest issues with witnesses which is 
>>>>> >> the lack of monotonic reads. Implementation complexity wise this would 
>>>>> >> actually delete code compared to what would be required to complete IR 
>>>>> >> based witnesses because most of the heavy lifting is already done by 
>>>>> >> mutation tracking.
>>>>> >> 
>>>>> >> Log replicas also makes it much more practical to realize the cost 
>>>>> >> savings of witnesses because log replicas have easier to characterize 
>>>>> >> resource consumption requirements (write rate * 
>>>>> >> recovery/reconfiguration time) and target a 10x improvement in write 
>>>>> >> throughput.  This makes knowing how much capacity can be omitted safer 
>>>>> >> and easier.
>>>>> >> 
>>>>> >> Thanks,
>>>>> >> Ariel
>>>>> >

Re: [DISCUSS] CEP-46 Finish Transient Replication/Witnesses

Reply via email to