Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Benedict Elliott Smith Mon, 12 May 2025 09:35:38 -0700

> Like something doesn't add up here because if it always includes the base 
> table's primary key columns that means they could be storage attached by just 
> forbidding additional columns and there doesn't seem to be much utility in 
> including additional columns in the primary key?


You can re-order the keys, and they only need to be a part of the primary key 
not the partition key. I think you can specify an arbitrary order to the keys 
also, so you can change the effective sort order. So, the basic idea is you 
stipulate something like PRIMARY KEY ((v1),(ck1,pk1)).

This is basically a global index, with the restriction on single columns as 
keys only because we cannot cheaply read-before-write for eventually consistent 
operations. This restriction can easily be relaxed for Paxos and Accord based 
implementations, which can also safely include additional keys.

That said, I am not at all sure why they are called materialised views if we 
don’t support including any other data besides the lookup column and the 
primary key. We should really rename them once they work, both to make some 
sense and to break with the historical baggage.

> I think this can be represented as a tombstone which can always be fetched 
> from the base table on read or maybe some other arrangement? I agree it can't 
> feasibly be represented as an enumeration of the deletions at least not 
> synchronously and doing it async has its own problems.

If the base table must be read on read of an index/view, then I think this 
proposal is approximately linearizable for the view as well (though, I do not 
at all warrant this statement). You still need to propagate this eventually so 
that the views can cleanup. This also makes reads 2RT on read, which is rather 
costly.

> On 12 May 2025, at 16:10, Ariel Weisberg <[email protected]> wrote:
> 
> Hi,
> 
> I think it's worth taking a step back and looking at the current MV 
> restrictions which are pretty onerous.
> 
> A view must have a primary key and that primary key must conform to the 
> following restrictions:
> it must contain all the primary key columns of the base table. This ensures 
> that every row of the view correspond to exactly one row of the base table.
> it can only contain a single column that is not a primary key column in the 
> base table.
> At that point what exactly is the value in including anything except the 
> original primary key in the MV's primary key columns unless you are using an 
> ordered partitioner so you can iterate based on the leading primary key 
> columns?
> 
> Like something doesn't add up here because if it always includes the base 
> table's primary key columns that means they could be storage attached by just 
> forbidding additional columns and there doesn't seem to be much utility in 
> including additional columns in the primary key?
> 
> I'm not that clear on how much better it is to look something up in the MV vs 
> just looking at the base table or some non-materialized view of it. How 
> exactly are these MVs supposed to be used and what value do they provide?
> 
> Jeff Jirsa wrote:
>> There’s 2 things in this proposal that give me a lot of pause.
> 
> Runtian Liu pointed out that the CEP is sort of divided into two parts. The 
> first is the online part which is making reads/writes to MVs safer and more 
> reliable using a transaction system. The second is offline which is repair.
> 
> The story for the online portion I think is quite strong and worth 
> considering on its own merits.
> 
> The offline portion (repair) sounds a little less feasible to run in 
> production, but I also think that MVs without any mechanism for checking 
> their consistency are not viable to run in production. So it's kind of pay 
> for what you use in terms of the feature?
> 
> It's definitely worth thinking through if there is a way to fix one side of 
> this equation so it works better.
> 
> David Capwell wrote:
>> As far as I can tell, being based off Accord means you don’t need to care 
>> about repair, as Accord will manage the consistency for you; you can’t get 
>> out of sync.
> I think a baseline requirement in C* for something to be in production is to 
> be able to run preview repair and validate that the transaction system or any 
> other part of Cassandra hasn't made a mistake. Divergence can have many 
> sources including Accord.
> 
> Runtian Liu wrote:
>> For the example David mentioned, LWT cannot support. Since LWTs operate on a 
>> single token, we’ll need to restrict base-table updates to one partition—and 
>> ideally one row—at a time. A current MV base-table command can delete an 
>> entire partition, but doing so might touch hundreds of MV partitions, making 
>> consistency guarantees impossible. 
> I think this can be represented as a tombstone which can always be fetched 
> from the base table on read or maybe some other arrangement? I agree it can't 
> feasibly be represented as an enumeration of the deletions at least not 
> synchronously and doing it async has its own problems.
> 
> Ariel
> 
> On Fri, May 9, 2025, at 4:03 PM, Jeff Jirsa wrote:
>> 
>> 
>>> On May 9, 2025, at 12:59 PM, Ariel Weisberg <[email protected]> wrote:
>>> 
>>> 
>>> I am *big* fan of getting repair really working with MVs. It does seem 
>>> problematic that the number of merkle trees will be equal to the number of 
>>> ranges in the cluster and repair of MVs would become an all node operation. 
>>>  How would down nodes be handled and how many nodes would simultaneously 
>>> working to validate a given base table range at once? How many base table 
>>> ranges could simultaneously be repairing MVs?
>>> 
>>> If a row containing a column that creates an MV partition is deleted, and 
>>> the MV isn't updated, then how does the merkle tree approach propagate the 
>>> deletion to the MV? The CEP says that anti-compaction would remove extra 
>>> rows, but I am not clear on how that works. When is anti-compaction 
>>> performed in the repair process and what is/isn't included in the outputs?
>> 
>> 
>> 
>> I thought about these two points last night after I sent my email.
>> 
>> There’s 2 things in this proposal that give me a lot of pause.
>> 
>> One is the lack of tombstones / deletions in the merle trees, which makes 
>> properly dealing with writes/deletes/inconsistency very hard (afaict)
>> 
>> The second is the reality that repairing a single partition in the base 
>> table may repair all hosts/ranges in the MV table, and vice versa. Basically 
>> scanning either base or MV is effectively scanning the whole cluster (modulo 
>> what you can avoid in the clean/dirty repaired sets). This makes me really, 
>> really concerned with how it scales, and how likely it is to be able to 
>> schedule automatically without blowing up. 
>> 
>> The paxos vs accord comments so far are interesting in that I think both 
>> could be made to work, but I am very concerned about how the merkle tree 
>> comparisons are likely to work with wide partitions leading to massive 
>> fanout in ranges. 
>> 
>> 
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to