I don't see mutation tracking [1] mentioned in this thread or in the CEP-48 description. Not sure this would fit into the scope of this initial CEP, but I have a feeling that mutation tracking could be potentially helpful to reconcile base tables and views ?
For example, when both base and view updates are acknowledged then this could be somehow persisted in the view sstables mutation tracking summary[2] or similar metadata ? Then these updates would be skipped during view repair, considerably reducing the amount of work needed, since only un-acknowledged views updates would need to be reconciled. [1] - https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking| [2] - https://issues.apache.org/jira/browse/CASSANDRA-20336 On Wed, May 14, 2025 at 12:59 PM Paulo Motta <pauloricard...@gmail.com> wrote: > > - The first thing I notice is that we're talking about repairing the > entire table across the entire cluster all in one go. It's been a *long* > time since I tried to do a full repair of an entire table without using > sub-ranges. Is anyone here even doing that with clusters of non-trivial > size? How long does a full repair of a 100 node cluster with 5TB / node > take even in the best case scenario? > > I haven't checked the CEP yet so I may be missing out something but I > think this effort doesn't need to be conflated with dense node support, to > make this more approachable. I think prospective users would be OK with > overprovisioning to make this feasible if needed. We could perhaps have > size guardrails that limit the maximum table size per node when MVs are > enabled. Ideally we should make it work for dense nodes if possible, but > this shouldn't be a reason not to support the feature if it can be made to > work reasonably with more resources. > > I think the main issue with the current MV is about correctness, and the > ultimate goal of the CEP must be to provide correctness guarantees, even if > it has an inevitable performance hit. I think that the performance of the > repair process is definitely an important consideration and it would be > helpful to have some benchmarks to have an idea of how long this repair > process would take for lightweight and denser tables. > > On Wed, May 14, 2025 at 7:28 AM Jon Haddad <j...@rustyrazorblade.com> > wrote: > >> I've got several concerns around this repair process. >> >> - The first thing I notice is that we're talking about repairing the >> entire table across the entire cluster all in one go. It's been a *long* >> time since I tried to do a full repair of an entire table without using >> sub-ranges. Is anyone here even doing that with clusters of non trivial >> size? How long does a full repair of a 100 node cluster with 5TB / node >> take even in the best case scenario? >> >> - Even in a scenario where sub-range repair is supported, you'd have to >> scan *every* sstable on the base table in order to construct the a merkle >> tree, as we don't know in advance which SSTables contain the ranges that >> the MV will. That means a subrange repair would have to do a *ton* of IO. >> Anyone who's mis-configured a sub-range incremental repair to use too many >> ranges will probably be familiar with how long it can take to anti-compact >> a bunch of SSTables. With MV sub-range repair, we'd have even more >> overhead, because we'd have to read in every SSTable, every time. If we do >> 10 subranges, we'll do 10x the IO of a normal repair. I don't think this >> is practical. >> >> - Merkle trees make sense when you're comparing tables with the same >> partition key, but I don't think they do when you're transforming a base >> table to a view. When there's a mis-match, what's transferred? We have a >> range of data in the MV, but now we have to go find that from the base >> table. That means the merkle tree needs to not just track the hashes and >> ranges, but the original keys it was transformed from, in order to go find >> all of the matching partitions in that mis-matched range. Either that or >> we end up rescanning the entire dataset in order to find the mismatches. >> >> Jon >> >> >> >> >> On Tue, May 13, 2025 at 10:29 AM Runtian Liu <curly...@gmail.com> wrote: >> >>> > Looking at the details of the CEP it seems to describe Paxos as >>> PaxosV1, but PaxosV2 works slightly differently (it can read during the >>> prepare phase). I assume that supporting Paxos means supporting both V1 and >>> V2 for materialized views? >>> We are going to support Paxos V2. The CEP is not clear on that, we add >>> this to clarify that. >>> >>> It looks like the online portion is now fairly well understood. For the >>> offline repair part, I see two main concerns: one around the scalability of >>> the proposed approach, and another regarding how it handles tombstones. >>> >>> Scalability: >>> I have added a section >>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-MVRepairVSFullRepairwithanExample> >>> in the CEP with an example to compare full repair and the proposed MV >>> repair, the overall scalability should not be a problem. >>> >>> Consider a dataset with tokens from 1 to 4 and a cluster of 4 nodes, >>> where each node owns one token. The base table uses (pk, ck) as its primary >>> key, while the materialized view (MV) uses (ck, pk) as its primary key. >>> Both tables include a value column v, which allows us to correlate rows >>> between them. The dataset consists of 16 records, distributed as follows: >>> >>> *Base table* >>> (pk, ck, v) >>> (1, 1, 1), (1, 2, 2), (1, 3, 3), (1, 4, 4) // N1 >>> (2, 1, 5), (2, 2, 6), (2, 3, 7), (2, 4, 8) // N2 >>> (3, 1, 9), (3, 2, 10), (3, 3, 11), (3, 4, 12) // N3 >>> (4, 1, 13), (4, 2, 14), (4, 3, 15), (4, 4, 16) // N4 >>> >>> *Materialized view* >>> (ck, pk, v) >>> (1, 1, 1), (1, 2, 5), (1, 3, 9), (1, 4, 13) // N1 >>> (2, 1, 2), (2, 2, 6), (2, 3, 10), (2, 4, 14) // N2 >>> (3, 1, 3), (3, 2, 7), (3, 3, 11), (3, 4, 15) // N3 >>> (4, 1, 4), (4, 2, 8), (4, 3, 12), (4, 4, 16) // N4 >>> >>> The chart below compares one round of full repair with one round of MV >>> repair. As shown, both scan the same total number of rows. However, MV >>> repair has higher time complexity because its Merkle tree processes each >>> row more intensively. To avoid all nodes scanning the entire table >>> simultaneously, MV repair should use a snapshot-based approach, similar to >>> normal repair with the --sequential option. Time complexity increase >>> compare to full repair can be found in the "Complexity and Memory >>> Management" section. >>> >>> n: number of rows >>> >>> d: depth of one Merkle tree for MV repair >>> >>> d': depth of one Merkle tree for full repair >>> >>> r: number of split ranges >>> >>> Assuming one leaf node covers same amount of rows, 2^d' = (2^d) * r. >>> >>> We can see that the space complexity is the same, while MV repair has >>> higher time complexity. However, this should not pose a significant issue >>> in production, as the Merkle tree depth and the number of split ranges are >>> typically not large. >>> >>> 1 Round Merkle Tree Building Complexity >>> Full Repair >>> MV Repair >>> Time complexity O(n) O(n*d*log(r)) >>> Space complexity O((2^d')*r) O((2^d)*r^2) = O((2^d')*r) >>> >>> Tombstone: >>> >>> The current proposal focuses on rebuilding the MV for a granular token >>> range where a mismatch is detected, rather than rebuilding the entire MV >>> token range. Since the MV is treated as a regular table, standard full or >>> incremental repair processes should still apply to both the base and MV >>> tables to keep their replicas in sync. >>> >>> Regarding tombstones, if we introduce special tombstone types or >>> handling mechanisms for the MV table, we may be able to support tombstone >>> synchronization between the base table and the MV. I plan to spend more >>> time exploring whether we can introduce changes to the base table that >>> enable this synchronization. >>> >>> >>> >>> On Mon, May 12, 2025 at 11:35 AM Jaydeep Chovatia < >>> chovatia.jayd...@gmail.com> wrote: >>> >>>> >Like something doesn't add up here because if it always includes the >>>> base table's primary key columns that means >>>> >>>> The requirement for materialized views (MVs) to include the base >>>> table's primary key appears to be primarily a syntactic constraint specific >>>> to Apache Cassandra. For instance, in DynamoDB, the DDL for defining a >>>> Global Secondary Index does not mandate inclusion of the base table's >>>> primary key. This suggests that the syntax requirement in Cassandra could >>>> potentially be relaxed in the future (outside the scope of this CEP). As >>>> Benedict noted, the base table's primary key is optional when querying a >>>> materialized view. >>>> >>>> Jaydeep >>>> >>>> On Mon, May 12, 2025 at 10:45 AM Jon Haddad <j...@rustyrazorblade.com> >>>> wrote: >>>> >>>>> >>>>> > Or compaction hasn’t made a mistake, or cell merge reconciliation >>>>> hasn’t made a mistake, or volume bitrot hasn’t caused you to lose a file. >>>>> > Repair isnt’ just about “have all transaction commits landed”. It’s >>>>> “is the data correct N days after it’s written”. >>>>> >>>>> Don't forget about restoring from a backup. >>>>> >>>>> Is there a way we could do some sort of hybrid compaction + >>>>> incremental repair? Maybe have the MV verify it's view while it's >>>>> compacting, and when it's done, mark the view's SSTable as repaired? Then >>>>> the repair process would only need to do a MV to MV repair. >>>>> >>>>> Jon >>>>> >>>>> >>>>> On Mon, May 12, 2025 at 9:37 AM Benedict Elliott Smith < >>>>> bened...@apache.org> wrote: >>>>> >>>>>> Like something doesn't add up here because if it always includes the >>>>>> base table's primary key columns that means they could be storage >>>>>> attached >>>>>> by just forbidding additional columns and there doesn't seem to be much >>>>>> utility in including additional columns in the primary key? >>>>>> >>>>>> >>>>>> You can re-order the keys, and they only need to be a part of the >>>>>> primary key not the partition key. I think you can specify an arbitrary >>>>>> order to the keys also, so you can change the effective sort order. So, >>>>>> the >>>>>> basic idea is you stipulate something like PRIMARY KEY ((v1),(ck1,pk1)). >>>>>> >>>>>> This is basically a global index, with the restriction on single >>>>>> columns as keys only because we cannot cheaply read-before-write for >>>>>> eventually consistent operations. This restriction can easily be relaxed >>>>>> for Paxos and Accord based implementations, which can also safely include >>>>>> additional keys. >>>>>> >>>>>> That said, I am not at all sure why they are called materialised >>>>>> views if we don’t support including any other data besides the lookup >>>>>> column and the primary key. We should really rename them once they work, >>>>>> both to make some sense and to break with the historical baggage. >>>>>> >>>>>> I think this can be represented as a tombstone which can always be >>>>>> fetched from the base table on read or maybe some other arrangement? I >>>>>> agree it can't feasibly be represented as an enumeration of the deletions >>>>>> at least not synchronously and doing it async has its own problems. >>>>>> >>>>>> >>>>>> If the base table must be read on read of an index/view, then I think >>>>>> this proposal is approximately linearizable for the view as well >>>>>> (though, I >>>>>> do not at all warrant this statement). You still need to propagate this >>>>>> eventually so that the views can cleanup. This also makes reads 2RT on >>>>>> read, which is rather costly. >>>>>> >>>>>> On 12 May 2025, at 16:10, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I think it's worth taking a step back and looking at the current MV >>>>>> restrictions which are pretty onerous. >>>>>> >>>>>> A view must have a primary key and that primary key must conform to >>>>>> the following restrictions: >>>>>> >>>>>> - it must contain all the primary key columns of the base table. >>>>>> This ensures that every row of the view correspond to exactly one row >>>>>> of >>>>>> the base table. >>>>>> - it can only contain a single column that is not a primary key >>>>>> column in the base table. >>>>>> >>>>>> At that point what exactly is the value in including anything except >>>>>> the original primary key in the MV's primary key columns unless you are >>>>>> using an ordered partitioner so you can iterate based on the leading >>>>>> primary key columns? >>>>>> >>>>>> Like something doesn't add up here because if it always includes the >>>>>> base table's primary key columns that means they could be storage >>>>>> attached >>>>>> by just forbidding additional columns and there doesn't seem to be much >>>>>> utility in including additional columns in the primary key? >>>>>> >>>>>> I'm not that clear on how much better it is to look something up in >>>>>> the MV vs just looking at the base table or some non-materialized view of >>>>>> it. How exactly are these MVs supposed to be used and what value do they >>>>>> provide? >>>>>> >>>>>> Jeff Jirsa wrote: >>>>>> >>>>>> There’s 2 things in this proposal that give me a lot of pause. >>>>>> >>>>>> >>>>>> Runtian Liu pointed out that the CEP is sort of divided into two >>>>>> parts. The first is the online part which is making reads/writes to MVs >>>>>> safer and more reliable using a transaction system. The second is offline >>>>>> which is repair. >>>>>> >>>>>> The story for the online portion I think is quite strong and worth >>>>>> considering on its own merits. >>>>>> >>>>>> The offline portion (repair) sounds a little less feasible to run in >>>>>> production, but I also think that MVs without any mechanism for checking >>>>>> their consistency are not viable to run in production. So it's kind of >>>>>> pay >>>>>> for what you use in terms of the feature? >>>>>> >>>>>> It's definitely worth thinking through if there is a way to fix one >>>>>> side of this equation so it works better. >>>>>> >>>>>> David Capwell wrote: >>>>>> >>>>>> As far as I can tell, being based off Accord means you don’t need to >>>>>> care about repair, as Accord will manage the consistency for you; you >>>>>> can’t >>>>>> get out of sync. >>>>>> >>>>>> I think a baseline requirement in C* for something to be in >>>>>> production is to be able to run preview repair and validate that the >>>>>> transaction system or any other part of Cassandra hasn't made a mistake. >>>>>> Divergence can have many sources including Accord. >>>>>> >>>>>> Runtian Liu wrote: >>>>>> >>>>>> For the example David mentioned, LWT cannot support. Since LWTs >>>>>> operate on a single token, we’ll need to restrict base-table updates to >>>>>> one >>>>>> partition—and ideally one row—at a time. A current MV base-table command >>>>>> can delete an entire partition, but doing so might touch hundreds of MV >>>>>> partitions, making consistency guarantees impossible. >>>>>> >>>>>> I think this can be represented as a tombstone which can always be >>>>>> fetched from the base table on read or maybe some other arrangement? I >>>>>> agree it can't feasibly be represented as an enumeration of the deletions >>>>>> at least not synchronously and doing it async has its own problems. >>>>>> >>>>>> Ariel >>>>>> >>>>>> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote: >>>>>> >>>>>> >>>>>> >>>>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> >>>>>> wrote: >>>>>> >>>>>> >>>>>> I am *big* fan of getting repair really working with MVs. It does >>>>>> seem problematic that the number of merkle trees will be equal to the >>>>>> number of ranges in the cluster and repair of MVs would become an all >>>>>> node >>>>>> operation. How would down nodes be handled and how many nodes would >>>>>> simultaneously working to validate a given base table range at once? How >>>>>> many base table ranges could simultaneously be repairing MVs? >>>>>> >>>>>> If a row containing a column that creates an MV partition is deleted, >>>>>> and the MV isn't updated, then how does the merkle tree approach >>>>>> propagate >>>>>> the deletion to the MV? The CEP says that anti-compaction would remove >>>>>> extra rows, but I am not clear on how that works. When is anti-compaction >>>>>> performed in the repair process and what is/isn't included in the >>>>>> outputs? >>>>>> >>>>>> >>>>>> >>>>>> I thought about these two points last night after I sent my email. >>>>>> >>>>>> There’s 2 things in this proposal that give me a lot of pause. >>>>>> >>>>>> One is the lack of tombstones / deletions in the merle trees, which >>>>>> makes properly dealing with writes/deletes/inconsistency very hard >>>>>> (afaict) >>>>>> >>>>>> The second is the reality that repairing a single partition in the >>>>>> base table may repair all hosts/ranges in the MV table, and vice versa. >>>>>> Basically scanning either base or MV is effectively scanning the whole >>>>>> cluster (modulo what you can avoid in the clean/dirty repaired sets). >>>>>> This >>>>>> makes me really, really concerned with how it scales, and how likely it >>>>>> is >>>>>> to be able to schedule automatically without blowing up. >>>>>> >>>>>> The paxos vs accord comments so far are interesting in that I think >>>>>> both could be made to work, but I am very concerned about how the merkle >>>>>> tree comparisons are likely to work with wide partitions leading to >>>>>> massive >>>>>> fanout in ranges. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>