Is this something we can disable? I can see scenarios where this would be strictly and severely worse then existing scenarios where we don't need repairs. ie short time window data, millions of writes a second that get thrown out after a few hours. If that data is small partitions we are nearly doubling the disk use for things we don't care about.
Chris On Wed, Jan 8, 2025 at 9:01 PM guo Maxwell <cclive1...@gmail.com> wrote: > After a brief understanding, there are 2 questions from me, If I ask > something inappropriate, please feel free to correct me : > > 1、 Does it support changing the table to support mutation tracking through > ALTER TABLE if it does not support mutation tracking before? > 2、 > >> Available options for tables are keyspace, legacy, and logged, with the >> default being keyspace, which inherits the keyspace setting > > > Do you think that keyspace_inherit (or other keywords that clearly > explain the behavior ) is better than name keyspace ? > In addition, is legacy appropriate? Because this is a new feature, there > is only the behavior of turning it on and off. Turning it off means not > using this feature. > If the keyword legacy is used, from the user's perspective, is it using an > old version of the mutation tracking? Similar to the relationship between > SAI and native2i. > > Jon Haddad <j...@rustyrazorblade.com> 于2025年1月9日周四 06:14写道: > >> JD, the fact that pagination is implemented as multiple queries is a >> design choice. A user performs a query with fetch size 1 or 100 and they >> will get different behavior. >> >> I'm not asking for anyone to implement MVCC. I'm asking for the docs >> around this to be correct. We should not use the term guarantee here, it's >> best effort. >> >> >> >> >> On Wed, Jan 8, 2025 at 2:06 PM J. D. Jordan <jeremiah.jor...@gmail.com> >> wrote: >> >>> Your pagination case is not a violation of any guarantees Cassandra >>> makes. It has never made guarantees across multiple queries. >>> Trying to have MVCC/consistent data across multiple queries is a very >>> different issue/problem from this CEP. If you want to have a discussion >>> about MVCC I suggest creating a new thread. >>> >>> -Jeremiah >>> >>> On Jan 8, 2025, at 3:47 PM, Jon Haddad <j...@rustyrazorblade.com> wrote: >>> >>> >>> > It's true that we can't offer multi-page write atomicity without some >>> sort of MVCC. There are a lot of common query patterns that don't involve >>> paging though, so it's not like the benefit of fixing write atomicity would >>> only apply to a small subset of carefully crafted queries or something. >>> >>> Sure, it'll work a lot, but we don't say "partition level write >>> atomicity some of the time". We say guarantee. From the CEP: >>> >>> > In the case of read repair, since we are only reading and correcting >>> the parts of a partition that we're reading and not the entire contents of >>> a partition on each read, read repair can break our *guarantee* on >>> partition level write atomicity. This approach also prevents meeting the >>> monotonic read requirement for witness replicas, which has significantly >>> limited its usefulness. >>> >>> I point this out because it's not well known, and we make a guarantee >>> that isn't true, and while the CEP will reduce the number of cases in which >>> we violate the guarantee, we will still have known edge cases that it >>> doesn't hold up. So we should stop saying it. >>> >>> >>> >>> >>> On Wed, Jan 8, 2025 at 1:30 PM Blake Eggleston <beggles...@apple.com> >>> wrote: >>> >>>> Thanks Dimitry and Jon, answers below >>>> >>>> 1) Is a single separate commit log expected to be created for all >>>> tables with the new replication type? >>>> >>>> >>>> The plan is to still have a single commit log, but only index mutations >>>> with a mutation id. >>>> >>>> 2) What is a granularity of storing mutation ids in memtable, is it per >>>> cell? >>>> >>>> >>>> It would be per-partition >>>> >>>> 3) If we update the same row multiple times while it is in a memtable - >>>> are all mutation ids appended to a kind of collection? >>>> >>>> >>>> They would yes. We might be able to do something where we stop tracking >>>> mutations that have been superseded by newer mutations (same cells, higher >>>> timestamps), but I suspect that would be more trouble than it's worth and >>>> would be out of scope for v1. >>>> >>>> 4) What is the expected size of a single id? >>>> >>>> >>>> It's currently 12bytes, a 4 byte node id (from tcm), and an 8 byte hlc >>>> >>>> 5) Do we plan to support multi-table batches (single or >>>> multi-partition) for this replication type? >>>> >>>> >>>> This is intended to support all existing features, however the tracking >>>> only happens at the mutation level, so the different mutations coming out >>>> of a multi-partition batch would all be tracked individually >>>> >>>> So even without repair mucking things up, we're unable to fulfill this >>>> promise except under the specific, ideal circumstance of querying a >>>> partition with only 1 page. >>>> >>>> >>>> It's true that we can't offer multi-page write atomicity without some >>>> sort of MVCC. There are a lot of common query patterns that don't involve >>>> paging though, so it's not like the benefit of fixing write atomicity would >>>> only apply to a small subset of carefully crafted queries or something. >>>> >>>> Thanks, >>>> >>>> Blake >>>> >>>> On Jan 8, 2025, at 12:23 PM, Jon Haddad <j...@rustyrazorblade.com> >>>> wrote: >>>> >>>> Very cool! I'll need to spent some time reading this over. One thing >>>> I did notice is this: >>>> >>>> > Cassandra promises partition level write atomicity. This means that, >>>> although writes are eventually consistent, a given write will either be >>>> visible or not visible. You're not supposed to see a partially applied >>>> write. However, read repair and short read protection can both "tear" >>>> mutations. In the case of read repair, this is because the data resolver >>>> only evaluates the data included in the client read. So if your read only >>>> covers a portion of a write that didn't reach a quorum, only that portion >>>> will be repaired, breaking write atomicity. >>>> >>>> Unfortunately there's more issues with this than just repair. Since we >>>> lack a consistency mechanism like MVCC while paginating, it's possible to >>>> do the following: >>>> >>>> thread A: reads a partition P with 10K rows, starts by reading the >>>> first page >>>> thread B: another thread writes a batch to 2 rows in partition P, one >>>> on page 1, another on page 2 >>>> thread A: reads the second page of P which has the mutation. >>>> >>>> I've worked with users who have been surprised by this behavior, >>>> because pagination happens transparently. >>>> >>>> So even without repair mucking things up, we're unable to fulfill this >>>> promise except under the specific, ideal circumstance of querying a >>>> partition with only 1 page. >>>> >>>> Jon >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Jan 8, 2025 at 11:21 AM Blake Eggleston <beggles...@apple.com> >>>> wrote: >>>> >>>>> Hello dev@, >>>>> >>>>> We'd like to propose CEP-45: Mutation Tracking for adoption by the >>>>> community. CEP-45 proposes adding a replication mechanism to track and >>>>> reconcile individual mutations, as well as processes to actively reconcile >>>>> missing mutations. >>>>> >>>>> For keyspaces with mutation tracking enabled, the immediate benefits >>>>> of this CEP are: >>>>> * reduced replication lag with a continuous background reconciliation >>>>> process >>>>> * eliminate the disk load caused by repair merkle tree calculation >>>>> * eliminate repair overstreaming >>>>> * reduce disk load of reads on cluster to close to 1/CL >>>>> * fix longstanding mutation atomicity issues caused by read repair and >>>>> short read protection >>>>> >>>>> Additionally, although it's outside the scope of this CEP, mutation >>>>> tracking would enable: >>>>> * completion of witness replicas / transient replication, making the >>>>> feature usable for all workloads >>>>> * lightweight witness only datacenters >>>>> >>>>> The CEP is linked here: >>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking, >>>>> but please keep the discussion on the dev list. >>>>> >>>>> Thanks! >>>>> >>>>> Blake Eggleston >>>>> >>>> >>>>