> This makes me really, really concerned with how it scales, and how likely it > is to be able to schedule automatically without blowing up. It seems to me that resource-aware throttling would be the solution here, or from a more primitive case, just hard bounding threadpool size, throughput rate, etc. Worst-case you end up in a situation where your MV anti-entropy can't keep up and you surface that to the operator rather than bogging down and/or killing your entire cluster due to touching all the nodes.
This isn't a problem isolated to MV's; our lack of robust resource scheduling and balancing between operations is a long-standing problem that just scales linearly w/the number of nodes in the cluster you have to hit. Same problem would arise from a global index; there's no deep technical reason querying all the nodes in the cluster should be a third rail excepting we currently have no way to prevent that from compounding and bringing the cluster down. On Fri, May 9, 2025, at 9:11 PM, Runtian Liu wrote: > I’ve added a new section on isolation and consistency > <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-IsolationandConsistency>. > In our current design, materialized-view tables stay eventually consistent, > while the base table offers linearizability. Here, “strict consistency” > refers to linearizable base-table updates, with every successful write > ensuring that the corresponding MV change is applied and visible. > > >Why mandate `LOCAL_QUORUM` instead of using the consistency level requested > >by the application? If they want to use `LOCAL_QUORUM` they can always > >request it. > > I think you meant LOCAL_SERIAL? Right, LOCAL_SERIAL should not be mandatory > and users should be able to select which consistency to use. Updated the page > for this one. > > For the example David mentioned, LWT cannot support. Since LWTs operate on a > single token, we’ll need to restrict base-table updates to one partition—and > ideally one row—at a time. A current MV base-table command can delete an > entire partition, but doing so might touch hundreds of MV partitions, making > consistency guarantees impossible. Limiting each operation’s scope lets us > ensure that every successful base-table write is accurately propagated to its > MV. Even with Accord backed MV, I think we will need to limit the number of > rows that get modified each time. > > Regarding repair, due to bugs, operator errors, or hardware faults, MVs can > become out of sync with their base tables—regardless of the chosen > synchronization method during writes. The purpose of MV repair is to detect > and resolve these mismatches using the base table as the source of truth. As > a result, if data resurrection occurs in the base table, the repair process > will propagate that resurrected data to the MV. > > >One is the lack of tombstones / deletions in the merle trees, which makes > >properly dealing with writes/deletes/inconsistency very hard (afaict) > > Tombstones are excluded because a base table update can produce a tombstone > in the MV—for example, when the updated cell is part of the MV's primary key. > Since such tombstones may not exist in the base table, we can only compare > live data during MV repair. > > > > > repairing a single partition in the base table may repair all hosts/ranges > > in the MV table, > That’s correct. To avoid repeatedly scanning both tables, the proposed > solution is for all nodes to take a snapshot first. Then, each node scans the > base table once and the MV table once, generating a list of Merkle trees from > each scan. These lists are then compared to identify mismatches. This means > MV repair must be performed at the table level rather than one token range at > a time to be efficient. > > >If a row containing a column that creates an MV partition is deleted, and > >the MV isn't updated, then how does the merkle tree approach propagate the > >deletion to the MV? The CEP says that anti-compaction would remove extra > >rows, but I am not clear on how that works. When is anti-compaction > >performed in the repair process and what is/isn't included in the outputs? > > Let me illustrate this with an example: > > We have the following base table and MV: > > `CREATE TABLE base (pk int, ck int, v int, PRIMARY KEY (pk, ck)); > CREATE MATERIALIZED VIEW mv AS SELECT * FROM base PRIMARY KEY (ck, pk); ` > Assume there are 100 rows in the base table (e.g., (1,1), (2,2), ..., > (100,100)), and accordingly, the MV also has 100 rows. Now, suppose the row > (55,55) is deleted from the base table, but due to some issue, it still > exists in the MV. > > Let's say each Merkle tree covers 20 rows in both the base and MV tables, so > we have a 5x5 grid—25 Merkle tree comparisons in total. Suppose the repair > job detects a mismatch in the range base(40–59) vs MV(40–59). > > On the node that owns the MV range (40–59), anti-compaction will be > triggered. If all 100 rows were in a single SSTable, it would be split into > two SSTables: one containing the 20 rows in the (40–59) range, and the other > containing the remaining 80 rows. > > On the base table side, the node will scan the (40–59) range, identify all > rows that map to the MV range (40–59)—which in this example would be 19 > rows—and stream them to the MV node. Once streaming completes, the MV node > can safely mark the 20-row SSTable as obsolete. In this way, the extra row in > MV is removed. > > The core idea is to reconstruct the MV data for base range (40–59) and MV > range (40–59) using the corresponding base table range as the source of truth. > > > > > On Fri, May 9, 2025 at 2:26 PM David Capwell <dcapw...@apple.com> wrote: >>> The MV repair tool in Cassandra is intended to address inconsistencies that >>> may occur in materialized views due to various factors. This component is >>> the most complex and demanding part of the development effort, representing >>> roughly 70% of the overall work. >> >>> but I am very concerned about how the merkle tree comparisons are likely to >>> work with wide partitions leading to massive fanout in ranges. >> >> As far as I can tell, being based off Accord means you don’t need to care >> about repair, as Accord will manage the consistency for you; you can’t get >> out of sync. >> >> Being based off accord also means you can deal with multiple >> partitions/tokens, where as LWT is limited to a single token. I am not sure >> how the following would work with the proposed design and LWT >> >> CREATE TABLE tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck)); >> CREATE MATERIALIZED VIEW tbl2 >> AS SELECT * FROM tbl WHERE ck > 42 PRIMARY KEY(pk, ck) >> >> — mutations >> UPDATE tbl SET v=42 WHERE pk IN (0, 1) AND ck IN (50, 74); — this touches 2 >> partition keys >> BEGIN BATCH — also touches 2 partition keys >> INSERT INTO tbl (pk, ck, v) VALUES (0, 47, 0); >> INSERT INTO tbl (pk, ck, v) VALUES (1, 48, 0); >> END BATCH >> >> >> >>> On May 9, 2025, at 1:03 PM, Jeff Jirsa <jji...@gmail.com> wrote: >>> >>> >>> >>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: >>>> >>>> >>>> I am *big* fan of getting repair really working with MVs. It does seem >>>> problematic that the number of merkle trees will be equal to the number of >>>> ranges in the cluster and repair of MVs would become an all node >>>> operation. How would down nodes be handled and how many nodes would >>>> simultaneously working to validate a given base table range at once? How >>>> many base table ranges could simultaneously be repairing MVs? >>>> >>>> If a row containing a column that creates an MV partition is deleted, and >>>> the MV isn't updated, then how does the merkle tree approach propagate the >>>> deletion to the MV? The CEP says that anti-compaction would remove extra >>>> rows, but I am not clear on how that works. When is anti-compaction >>>> performed in the repair process and what is/isn't included in the outputs? >>> >>> >>> I thought about these two points last night after I sent my email. >>> >>> There’s 2 things in this proposal that give me a lot of pause. >>> >>> One is the lack of tombstones / deletions in the merle trees, which makes >>> properly dealing with writes/deletes/inconsistency very hard (afaict) >>> >>> The second is the reality that repairing a single partition in the base >>> table may repair all hosts/ranges in the MV table, and vice versa. >>> Basically scanning either base or MV is effectively scanning the whole >>> cluster (modulo what you can avoid in the clean/dirty repaired sets). This >>> makes me really, really concerned with how it scales, and how likely it is >>> to be able to schedule automatically without blowing up. >>> >>> The paxos vs accord comments so far are interesting in that I think both >>> could be made to work, but I am very concerned about how the merkle tree >>> comparisons are likely to work with wide partitions leading to massive >>> fanout in ranges. >>> >>>