Re: [DISCUSS] CEP-45: Mutation Tracking

Blake Eggleston Thu, 09 Jan 2025 10:16:50 -0800

Hi Guo and Chris

>> Does it support changing the table to support mutation tracking through 
>> ALTER TABLE if it does not support mutation tracking before?


Yes, migration for existing keyspaces/tables will be supported.

>> Do you think that keyspace_inherit  (or other keywords that clearly explain 
>> the behavior ) is better than name keyspace ?  


No, I think keyspace_inherit is too verbose and keyspace is probably clear 
enough

>> In addition, is legacy appropriate? Because this is a new feature, there is 
>> only the behavior of turning it on and off. 

I don't think legacy is a bad name, it's the older, legacy replication 
mechanism. That said, I don't have a strong attachment to legacy, if you have 
suggestions for something better.

>> Is this something we can disable? I can see scenarios where this would be 
>> strictly and severely worse then existing scenarios

Sure, like the CEP says, this isn't meant to replace the existing replication 
system, but act as an alternative. If you don't configure a keyspace/table to 
use mutation tracking, then nothing changes.

>> short time window data, millions of writes a second that get thrown out 
>> after a few hours. If that data is small partitions we are nearly doubling 
>> the disk use for things we don't care about.


Right, this is a query pattern you wouldn't want to use mutation tracking for.


> On Jan 9, 2025, at 7:07 AM, Chris Lohfink <[email protected]> wrote:
> 
> Is this something we can disable? I can see scenarios where this would be 
> strictly and severely worse then existing scenarios where we don't need 
> repairs. ie short time window data, millions of writes a second that get 
> thrown out after a few hours. If that data is small partitions we are nearly 
> doubling the disk use for things we don't care about.
> 
> Chris
> 
> On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <[email protected] 
> <mailto:[email protected]>> wrote:
>> After a brief understanding, there are 2 questions from me, If I ask 
>> something inappropriate, please feel free to correct me :
>> 
>> 1、 Does it support changing the table to support mutation tracking through 
>> ALTER TABLE if it does not support mutation tracking before?
>> 2、
>>> Available options for tables are keyspace, legacy, and logged, with the 
>>> default being keyspace, which inherits the keyspace setting
>>  
>> Do you think that keyspace_inherit  (or other keywords that clearly explain 
>> the behavior ) is better than name keyspace ?  
>> In addition, is legacy appropriate? Because this is a new feature, there is 
>> only the behavior of turning it on and off. Turning it off means not using 
>> this feature. 
>> If the keyword legacy is used, from the user's perspective, is it using an 
>> old version of the mutation tracking? Similar to the relationship between 
>> SAI and native2i.
>> 
>> Jon Haddad <[email protected] <mailto:[email protected]>> 
>> 于2025年1月9日周四 06:14写道：
>>> JD, the fact that pagination is implemented as multiple queries is a design 
>>> choice.  A user performs a query with fetch size 1 or 100 and they will get 
>>> different behavior. 
>>> 
>>> I'm not asking for anyone to implement MVCC.  I'm asking for the docs 
>>> around this to be correct.  We should not use the term guarantee here, it's 
>>> best effort.
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> Your pagination case is not a violation of any guarantees Cassandra makes. 
>>>> It has never made guarantees across multiple queries.
>>>> Trying to have MVCC/consistent data across multiple queries is a very 
>>>> different issue/problem from this CEP.  If you want to have a discussion 
>>>> about MVCC I suggest creating a new thread.
>>>> 
>>>> -Jeremiah
>>>> 
>>>>> On Jan 8, 2025, at 3:47 PM, Jon Haddad <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> 
>>>>> > It's true that we can't offer multi-page write atomicity without some 
>>>>> > sort of MVCC. There are a lot of common query patterns that don't 
>>>>> > involve paging though, so it's not like the benefit of fixing write 
>>>>> > atomicity would only apply to a small subset of carefully crafted 
>>>>> > queries or something.
>>>>> 
>>>>> Sure, it'll work a lot, but we don't say "partition level write atomicity 
>>>>> some of the time".  We say guarantee.  From the CEP:
>>>>> 
>>>>> > In the case of read repair, since we are only reading and correcting 
>>>>> > the parts of a partition that we're reading and not the entire contents 
>>>>> > of a partition on each read, read repair can break our guarantee on 
>>>>> > partition level write atomicity. This approach also prevents meeting 
>>>>> > the monotonic read requirement for witness replicas, which has 
>>>>> > significantly limited its usefulness.
>>>>> 
>>>>> I point this out because it's not well known, and we make a guarantee 
>>>>> that isn't true, and while the CEP will reduce the number of cases in 
>>>>> which we violate the guarantee, we will still have known edge cases that 
>>>>> it doesn't hold up.  So we should stop saying it. 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> Thanks Dimitry and Jon, answers below
>>>>>> 
>>>>>>> 1) Is a single separate commit log expected to be created for all 
>>>>>>> tables with the new replication type?
>>>>>> 
>>>>>> The plan is to still have a single commit log, but only index mutations 
>>>>>> with a mutation id. 
>>>>>> 
>>>>>>> 2) What is a granularity of storing mutation ids in memtable, is it per 
>>>>>>> cell?
>>>>>> 
>>>>>> It would be per-partition
>>>>>> 
>>>>>>> 3) If we update the same row multiple times while it is in a memtable - 
>>>>>>> are all mutation ids appended to a kind of collection?
>>>>>> 
>>>>>> They would yes. We might be able to do something where we stop tracking 
>>>>>> mutations that have been superseded by newer mutations (same cells, 
>>>>>> higher timestamps), but I suspect that would be more trouble than it's 
>>>>>> worth and would be out of scope for v1.
>>>>>> 
>>>>>>> 4) What is the expected size of a single id?
>>>>>> 
>>>>>> It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc
>>>>>> 
>>>>>>> 5) Do we plan to support multi-table batches (single or 
>>>>>>> multi-partition) for this replication type?
>>>>>> 
>>>>>> 
>>>>>> This is intended to support all existing features, however the tracking 
>>>>>> only happens at the mutation level, so the different mutations coming 
>>>>>> out of a multi-partition batch would all be tracked individually
>>>>>> 
>>>>>>> So even without repair mucking things up, we're unable to fulfill this 
>>>>>>> promise except under the specific, ideal circumstance of querying a 
>>>>>>> partition with only 1 page.
>>>>>> 
>>>>>> 
>>>>>> It's true that we can't offer multi-page write atomicity without some 
>>>>>> sort of MVCC. There are a lot of common query patterns that don't 
>>>>>> involve paging though, so it's not like the benefit of fixing write 
>>>>>> atomicity would only apply to a small subset of carefully crafted 
>>>>>> queries or something.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Blake
>>>>>> 
>>>>>>> On Jan 8, 2025, at 12:23 PM, Jon Haddad <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Very cool!  I'll need to spent some time reading this over.  One thing 
>>>>>>> I did notice is this:
>>>>>>> 
>>>>>>> > Cassandra promises partition level write atomicity. This means that, 
>>>>>>> > although writes are eventually consistent, a given write will either 
>>>>>>> > be visible or not visible. You're not supposed to see a partially 
>>>>>>> > applied write. However, read repair and short read protection can 
>>>>>>> > both "tear" mutations. In the case of read repair, this is because 
>>>>>>> > the data resolver only evaluates the data included in the client 
>>>>>>> > read. So if your read only covers a portion of a write that didn't 
>>>>>>> > reach a quorum, only that portion will be repaired, breaking write 
>>>>>>> > atomicity.
>>>>>>> 
>>>>>>> Unfortunately there's more issues with this than just repair.  Since we 
>>>>>>> lack a consistency mechanism like MVCC while paginating, it's possible 
>>>>>>> to do the following:
>>>>>>> 
>>>>>>> thread A: reads a partition P with 10K rows, starts by reading the 
>>>>>>> first page
>>>>>>> thread B: another thread writes a batch to 2 rows in partition P, one 
>>>>>>> on page 1, another on page 2
>>>>>>> thread A: reads the second page of P which has the mutation.
>>>>>>> 
>>>>>>> I've worked with users who have been surprised by this behavior, 
>>>>>>> because pagination happens transparently.
>>>>>>> 
>>>>>>> So even without repair mucking things up, we're unable to fulfill this 
>>>>>>> promise except under the specific, ideal circumstance of querying a 
>>>>>>> partition with only 1 page.
>>>>>>> 
>>>>>>> Jon
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> Hello dev@,
>>>>>>>> 
>>>>>>>> We'd like to propose CEP-45: Mutation Tracking for adoption by the 
>>>>>>>> community. CEP-45 proposes adding a replication mechanism to track and 
>>>>>>>> reconcile individual mutations, as well as processes to actively 
>>>>>>>> reconcile missing mutations.
>>>>>>>> 
>>>>>>>> For keyspaces with mutation tracking enabled, the immediate benefits 
>>>>>>>> of this CEP are:
>>>>>>>> * reduced replication lag with a continuous background reconciliation 
>>>>>>>> process
>>>>>>>> * eliminate the disk load caused by repair merkle tree calculation
>>>>>>>> * eliminate repair overstreaming
>>>>>>>> * reduce disk load of reads on cluster to close to 1/CL
>>>>>>>> * fix longstanding mutation atomicity issues caused by read repair and 
>>>>>>>> short read protection
>>>>>>>> 
>>>>>>>> Additionally, although it's outside the scope of this CEP, mutation 
>>>>>>>> tracking would enable:
>>>>>>>> * completion of witness replicas / transient replication, making the 
>>>>>>>> feature usable for all workloads
>>>>>>>> * lightweight witness only datacenters
>>>>>>>> 
>>>>>>>> The CEP is linked here: 
>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
>>>>>>>>  but please keep the discussion on the dev list.
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Blake Eggleston
>>>>>>

Re: [DISCUSS] CEP-45: Mutation Tracking

Reply via email to