Re: [DISCUSS] CEP-45: Mutation Tracking

Chris Lohfink Thu, 09 Jan 2025 07:08:19 -0800

Is this something we can disable? I can see scenarios where this would be
strictly and severely worse then existing scenarios where we don't need
repairs. ie short time window data, millions of writes a second that get
thrown out after a few hours. If that data is small partitions we are
nearly doubling the disk use for things we don't care about.


Chris

On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com> wrote:

> After a brief understanding, there are 2 questions from me, If I ask
> something inappropriate, please feel free to correct me :
>
> 1、 Does it support changing the table to support mutation tracking through
> ALTER TABLE if it does not support mutation tracking before?
> 2、
>
>> Available options for tables are keyspace, legacy, and logged, with the
>> default being keyspace, which inherits the keyspace setting
>
>
> Do you think that keyspace_inherit  (or other keywords that clearly
> explain the behavior ) is better than name keyspace ?
> In addition, is legacy appropriate? Because this is a new feature, there
> is only the behavior of turning it on and off. Turning it off means not
> using this feature.
> If the keyword legacy is used, from the user's perspective, is it using an
> old version of the mutation tracking? Similar to the relationship between
> SAI and native2i.
>
> Jon Haddad <j...@rustyrazorblade.com> 于2025年1月9日周四 06:14写道：
>
>> JD, the fact that pagination is implemented as multiple queries is a
>> design choice.  A user performs a query with fetch size 1 or 100 and they
>> will get different behavior.
>>
>> I'm not asking for anyone to implement MVCC.  I'm asking for the docs
>> around this to be correct.  We should not use the term guarantee here, it's
>> best effort.
>>
>>
>>
>>
>> On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan <jeremiah.jor...@gmail.com>
>> wrote:
>>
>>> Your pagination case is not a violation of any guarantees Cassandra
>>> makes. It has never made guarantees across multiple queries.
>>> Trying to have MVCC/consistent data across multiple queries is a very
>>> different issue/problem from this CEP.  If you want to have a discussion
>>> about MVCC I suggest creating a new thread.
>>>
>>> -Jeremiah
>>>
>>> On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com> wrote:
>>>
>>> 
>>> > It's true that we can't offer multi-page write atomicity without some
>>> sort of MVCC. There are a lot of common query patterns that don't involve
>>> paging though, so it's not like the benefit of fixing write atomicity would
>>> only apply to a small subset of carefully crafted queries or something.
>>>
>>> Sure, it'll work a lot, but we don't say "partition level write
>>> atomicity some of the time".  We say guarantee.  From the CEP:
>>>
>>> > In the case of read repair, since we are only reading and correcting
>>> the parts of a partition that we're reading and not the entire contents of
>>> a partition on each read, read repair can break our *guarantee* on
>>> partition level write atomicity. This approach also prevents meeting the
>>> monotonic read requirement for witness replicas, which has significantly
>>> limited its usefulness.
>>>
>>> I point this out because it's not well known, and we make a guarantee
>>> that isn't true, and while the CEP will reduce the number of cases in which
>>> we violate the guarantee, we will still have known edge cases that it
>>> doesn't hold up.  So we should stop saying it.
>>>
>>>
>>>
>>>
>>> On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com>
>>> wrote:
>>>
>>>> Thanks Dimitry and Jon, answers below
>>>>
>>>> 1) Is a single separate commit log expected to be created for all
>>>> tables with the new replication type?
>>>>
>>>>
>>>> The plan is to still have a single commit log, but only index mutations
>>>> with a mutation id.
>>>>
>>>> 2) What is a granularity of storing mutation ids in memtable, is it per
>>>> cell?
>>>>
>>>>
>>>> It would be per-partition
>>>>
>>>> 3) If we update the same row multiple times while it is in a memtable -
>>>> are all mutation ids appended to a kind of collection?
>>>>
>>>>
>>>> They would yes. We might be able to do something where we stop tracking
>>>> mutations that have been superseded by newer mutations (same cells, higher
>>>> timestamps), but I suspect that would be more trouble than it's worth and
>>>> would be out of scope for v1.
>>>>
>>>> 4) What is the expected size of a single id?
>>>>
>>>>
>>>> It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc
>>>>
>>>> 5) Do we plan to support multi-table batches (single or
>>>> multi-partition) for this replication type?
>>>>
>>>>
>>>> This is intended to support all existing features, however the tracking
>>>> only happens at the mutation level, so the different mutations coming out
>>>> of a multi-partition batch would all be tracked individually
>>>>
>>>> So even without repair mucking things up, we're unable to fulfill this
>>>> promise except under the specific, ideal circumstance of querying a
>>>> partition with only 1 page.
>>>>
>>>>
>>>> It's true that we can't offer multi-page write atomicity without some
>>>> sort of MVCC. There are a lot of common query patterns that don't involve
>>>> paging though, so it's not like the benefit of fixing write atomicity would
>>>> only apply to a small subset of carefully crafted queries or something.
>>>>
>>>> Thanks,
>>>>
>>>> Blake
>>>>
>>>> On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com>
>>>> wrote:
>>>>
>>>> Very cool!  I'll need to spent some time reading this over.  One thing
>>>> I did notice is this:
>>>>
>>>> > Cassandra promises partition level write atomicity. This means that,
>>>> although writes are eventually consistent, a given write will either be
>>>> visible or not visible. You're not supposed to see a partially applied
>>>> write. However, read repair and short read protection can both "tear"
>>>> mutations. In the case of read repair, this is because the data resolver
>>>> only evaluates the data included in the client read. So if your read only
>>>> covers a portion of a write that didn't reach a quorum, only that portion
>>>> will be repaired, breaking write atomicity.
>>>>
>>>> Unfortunately there's more issues with this than just repair.  Since we
>>>> lack a consistency mechanism like MVCC while paginating, it's possible to
>>>> do the following:
>>>>
>>>> thread A: reads a partition P with 10K rows, starts by reading the
>>>> first page
>>>> thread B: another thread writes a batch to 2 rows in partition P, one
>>>> on page 1, another on page 2
>>>> thread A: reads the second page of P which has the mutation.
>>>>
>>>> I've worked with users who have been surprised by this behavior,
>>>> because pagination happens transparently.
>>>>
>>>> So even without repair mucking things up, we're unable to fulfill this
>>>> promise except under the specific, ideal circumstance of querying a
>>>> partition with only 1 page.
>>>>
>>>> Jon
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com>
>>>> wrote:
>>>>
>>>>> Hello dev@,
>>>>>
>>>>> We'd like to propose CEP-45: Mutation Tracking for adoption by the
>>>>> community. CEP-45 proposes adding a replication mechanism to track and
>>>>> reconcile individual mutations, as well as processes to actively reconcile
>>>>> missing mutations.
>>>>>
>>>>> For keyspaces with mutation tracking enabled, the immediate benefits
>>>>> of this CEP are:
>>>>> * reduced replication lag with a continuous background reconciliation
>>>>> process
>>>>> * eliminate the disk load caused by repair merkle tree calculation
>>>>> * eliminate repair overstreaming
>>>>> * reduce disk load of reads on cluster to close to 1/CL
>>>>> * fix longstanding mutation atomicity issues caused by read repair and
>>>>> short read protection
>>>>>
>>>>> Additionally, although it's outside the scope of this CEP, mutation
>>>>> tracking would enable:
>>>>> * completion of witness replicas / transient replication, making the
>>>>> feature usable for all workloads
>>>>> * lightweight witness only datacenters
>>>>>
>>>>> The CEP is linked here:
>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking,
>>>>> but please keep the discussion on the dev list.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Blake Eggleston
>>>>>
>>>>
>>>>

Re: [DISCUSS] CEP-45: Mutation Tracking

Reply via email to