Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Josh McKenzie Sun, 11 May 2025 05:19:54 -0700

> This makes me really, really concerned with how it scales, and how likely it 
> is to be able to schedule automatically without blowing up. 
It seems to me that resource-aware throttling would be the solution here, or 
from a more primitive case, just hard bounding threadpool size, throughput 
rate, etc. Worst-case you end up in a situation where your MV anti-entropy 
can't keep up and you surface that to the operator rather than bogging down 
and/or killing your entire cluster due to touching all the nodes.


This isn't a problem isolated to MV's; our lack of robust resource scheduling 
and balancing between operations is a long-standing problem that just scales 
linearly w/the number of nodes in the cluster you have to hit. Same problem 
would arise from a global index; there's no deep technical reason querying all 
the nodes in the cluster should be a third rail excepting we currently have no 
way to prevent that from compounding and bringing the cluster down. 

On Fri, May 9, 2025, at 9:11 PM, Runtian Liu wrote:
> I’ve added a new section on isolation and consistency 
> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-48%3A+First-Class+Materialized+View+Support#CEP48:FirstClassMaterializedViewSupport-IsolationandConsistency>.
>  In our current design, materialized-view tables stay eventually consistent, 
> while the base table offers linearizability. Here, “strict consistency” 
> refers to linearizable base-table updates, with every successful write 
> ensuring that the corresponding MV change is applied and visible.
> 
> >Why mandate `LOCAL_QUORUM` instead of using the consistency level requested 
> >by the application? If they want to use `LOCAL_QUORUM` they can always 
> >request it.
> 
> I think you meant LOCAL_SERIAL? Right, LOCAL_SERIAL should not be mandatory 
> and users should be able to select which consistency to use. Updated the page 
> for this one.
> 
> For the example David mentioned, LWT cannot support. Since LWTs operate on a 
> single token, we’ll need to restrict base-table updates to one partition—and 
> ideally one row—at a time. A current MV base-table command can delete an 
> entire partition, but doing so might touch hundreds of MV partitions, making 
> consistency guarantees impossible. Limiting each operation’s scope lets us 
> ensure that every successful base-table write is accurately propagated to its 
> MV. Even with Accord backed MV, I think we will need to limit the number of 
> rows that get modified each time.
> 
> Regarding repair, due to bugs, operator errors, or hardware faults, MVs can 
> become out of sync with their base tables—regardless of the chosen 
> synchronization method during writes. The purpose of MV repair is to detect 
> and resolve these mismatches using the base table as the source of truth. As 
> a result, if data resurrection occurs in the base table, the repair process 
> will propagate that resurrected data to the MV.
> 
> >One is the lack of tombstones / deletions in the merle trees, which makes 
> >properly dealing with writes/deletes/inconsistency very hard (afaict)
> 
> Tombstones are excluded because a base table update can produce a tombstone 
> in the MV—for example, when the updated cell is part of the MV's primary key. 
> Since such tombstones may not exist in the base table, we can only compare 
> live data during MV repair.
> 
> 
> 
> > repairing a single partition in the base table may repair all hosts/ranges 
> > in the MV table,
> That’s correct. To avoid repeatedly scanning both tables, the proposed 
> solution is for all nodes to take a snapshot first. Then, each node scans the 
> base table once and the MV table once, generating a list of Merkle trees from 
> each scan. These lists are then compared to identify mismatches. This means 
> MV repair must be performed at the table level rather than one token range at 
> a time to be efficient.
> 
> >If a row containing a column that creates an MV partition is deleted, and 
> >the MV isn't updated, then how does the merkle tree approach propagate the 
> >deletion to the MV? The CEP says that anti-compaction would remove extra 
> >rows, but I am not clear on how that works. When is anti-compaction 
> >performed in the repair process and what is/isn't included in the outputs?
> 
> Let me illustrate this with an example:
> 
> We have the following base table and MV:
> 
> `CREATE TABLE base (pk int, ck int, v int, PRIMARY KEY (pk, ck));
> CREATE MATERIALIZED VIEW mv AS SELECT * FROM base PRIMARY KEY (ck, pk);
`
> Assume there are 100 rows in the base table (e.g., (1,1), (2,2), ..., 
> (100,100)), and accordingly, the MV also has 100 rows. Now, suppose the row 
> (55,55) is deleted from the base table, but due to some issue, it still 
> exists in the MV.
> 
> Let's say each Merkle tree covers 20 rows in both the base and MV tables, so 
> we have a 5x5 grid—25 Merkle tree comparisons in total. Suppose the repair 
> job detects a mismatch in the range base(40–59) vs MV(40–59).
> 
> On the node that owns the MV range (40–59), anti-compaction will be 
> triggered. If all 100 rows were in a single SSTable, it would be split into 
> two SSTables: one containing the 20 rows in the (40–59) range, and the other 
> containing the remaining 80 rows.
> 
> On the base table side, the node will scan the (40–59) range, identify all 
> rows that map to the MV range (40–59)—which in this example would be 19 
> rows—and stream them to the MV node. Once streaming completes, the MV node 
> can safely mark the 20-row SSTable as obsolete. In this way, the extra row in 
> MV is removed.
> 
> The core idea is to reconstruct the MV data for base range (40–59) and MV 
> range (40–59) using the corresponding base table range as the source of truth.
> 
> 
> 
> 
> On Fri, May 9, 2025 at 2:26 PM David Capwell <dcapw...@apple.com> wrote:
>>> The MV repair tool in Cassandra is intended to address inconsistencies that 
>>> may occur in materialized views due to various factors. This component is 
>>> the most complex and demanding part of the development effort, representing 
>>> roughly 70% of the overall work.
>> 
>>> but I am very concerned about how the merkle tree comparisons are likely to 
>>> work with wide partitions leading to massive fanout in ranges. 
>> 
>> As far as I can tell, being based off Accord means you don’t need to care 
>> about repair, as Accord will manage the consistency for you; you can’t get 
>> out of sync.
>> 
>> Being based off accord also means you can deal with multiple 
>> partitions/tokens, where as LWT is limited to a single token.  I am not sure 
>> how the following would work with the proposed design and LWT
>> 
>> CREATE TABLE tbl (pk int, ck int, v int, PRIMARY KEY (pk, ck));
>> CREATE MATERIALIZED VIEW tbl2
>> AS SELECT * FROM tbl WHERE ck > 42 PRIMARY KEY(pk, ck)
>> 
>> — mutations
>> UPDATE tbl SET v=42 WHERE pk IN (0, 1) AND ck IN (50, 74); — this touches 2 
>> partition keys
>> BEGIN BATCH — also touches 2 partition keys
>>   INSERT INTO tbl (pk, ck, v) VALUES (0, 47, 0);
>>   INSERT INTO tbl (pk, ck, v) VALUES (1, 48, 0);
>> END BATCH
>> 
>> 
>> 
>>> On May 9, 2025, at 1:03 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>> 
>>> 
>>> 
>>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>>>> 
>>>> 
>>>> I am *big* fan of getting repair really working with MVs. It does seem 
>>>> problematic that the number of merkle trees will be equal to the number of 
>>>> ranges in the cluster and repair of MVs would become an all node 
>>>> operation.  How would down nodes be handled and how many nodes would 
>>>> simultaneously working to validate a given base table range at once? How 
>>>> many base table ranges could simultaneously be repairing MVs?
>>>> 
>>>> If a row containing a column that creates an MV partition is deleted, and 
>>>> the MV isn't updated, then how does the merkle tree approach propagate the 
>>>> deletion to the MV? The CEP says that anti-compaction would remove extra 
>>>> rows, but I am not clear on how that works. When is anti-compaction 
>>>> performed in the repair process and what is/isn't included in the outputs?
>>> 
>>> 
>>> I thought about these two points last night after I sent my email.
>>> 
>>> There’s 2 things in this proposal that give me a lot of pause.
>>> 
>>> One is the lack of tombstones / deletions in the merle trees, which makes 
>>> properly dealing with writes/deletes/inconsistency very hard (afaict)
>>> 
>>> The second is the reality that repairing a single partition in the base 
>>> table may repair all hosts/ranges in the MV table, and vice versa. 
>>> Basically scanning either base or MV is effectively scanning the whole 
>>> cluster (modulo what you can avoid in the clean/dirty repaired sets). This 
>>> makes me really, really concerned with how it scales, and how likely it is 
>>> to be able to schedule automatically without blowing up. 
>>> 
>>> The paxos vs accord comments so far are interesting in that I think both 
>>> could be made to work, but I am very concerned about how the merkle tree 
>>> comparisons are likely to work with wide partitions leading to massive 
>>> fanout in ranges. 
>>> 
>>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to