Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Runtian Liu Fri, 16 May 2025 12:22:57 -0700

Unfortunately, no. When building Merkle trees for small token ranges in the
base table, those ranges may span the entire MV token range. As a result,
we need to scan the entire MV to generate all the necessary Merkle trees.
For efficiency, we perform this as a single pass over the entire table
rather than scanning a small range of the base or MV table individually. As
you mentioned, with storage becoming increasingly affordable, this approach
helps us save time and CPU resources.


On Fri, May 16, 2025 at 12:11 PM Jon Haddad <j...@rustyrazorblade.com> wrote:

> I spoke too soon - endless questions are not over :)
>
> Since the data that's going to be repaired only covers a range, I wonder
> if it makes sense to have the ability to issue a minimalist snapshot that
> only hardlinks SSTables that are in a token range.  Based on what you
> (Runtian) have said above, only a small percentage of the data would
> actually be repaired at any given time.
>
> Just a thought to save a little filesystem churn.
>
>
> On Fri, May 16, 2025 at 10:55 AM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
>> Nevermind about the height thing i guess its the same property.
>>
>> I’m done for now :)
>>
>> Thanks for entertaining my endless questions. My biggest concerns about
>> repair have been alleviated.
>>
>> Jon
>>
>> On Fri, May 16, 2025 at 10:34 AM Jon Haddad <j...@rustyrazorblade.com>
>> wrote:
>>
>>> Thats the critical bit i was missing, thank you Blake.
>>>
>>> I guess we’d need to have unlimited height trees then, since you’d need
>>> to be able to update the hashes of individual partitions, and we’d also
>>> need to propagate the hashes up every time as well. I’m curious what the
>>> cost will look like with that.
>>>
>>> At least it’s a cpu problem not an I/O one.
>>>
>>> Jon
>>>
>>>
>>> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston <bl...@ultrablake.com>
>>> wrote:
>>>
>>>> The merkle tree xor's the individual row hashes together, which is
>>>> commutative. So you should be able to build a tree in the view token order
>>>> while reading in base table token order and vise versa.
>>>>
>>>> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
>>>>
>>>> Thanks for the explanation, I appreciate it.  I think you might still
>>>> be glossing over an important point - which I'll make singularly here.
>>>> There's a number of things I'm concerned about, but this is a big one.
>>>>
>>>> Calculating the hash of a partition for a Merkle tree needs to be done
>>>> on the fully materialized, sorted partition.
>>>>
>>>> The examples you're giving are simple, to the point where they hide the
>>>> problem.  Here's a better example, where the MV has a clustering column. In
>>>> the MV's partition it'll have multiple rows, but in the base table it'll be
>>>> stored in different pages or different SSTables entirely:
>>>>
>>>> CREATE TABLE test.t1 (
>>>>     id int PRIMARY KEY,
>>>>     v1 int
>>>> );
>>>>
>>>> CREATE MATERIALIZED VIEW test.test_mv AS
>>>>     SELECT v1, id
>>>>     FROM test.t1
>>>>     WHERE id IS NOT NULL AND v1 IS NOT NULL
>>>>     PRIMARY KEY (v1, id)
>>>>  WITH CLUSTERING ORDER BY (id ASC);
>>>>
>>>>
>>>> Let's say we have some test data:
>>>>
>>>> cqlsh:test> select id, v1 from t1;
>>>>
>>>>  id | v1
>>>> ----+----
>>>>  10 | 11
>>>>   1 | 14
>>>>  19 | 10
>>>>   2 | 14
>>>>   3 | 14
>>>>
>>>> When we transform the data by iterating over the base table, we get
>>>> this representation (note v1=14):
>>>>
>>>> cqlsh:test> select v1, id from t1;
>>>>
>>>>  v1 | id
>>>> ----+----
>>>>  11 | 10
>>>>  14 |  1   <------
>>>>  10 | 19
>>>>  14 |  2 <------
>>>>  14 |  3  <------
>>>>
>>>>
>>>> The partiton key in the new table is v1.  If you simply iterate and
>>>> transform and calculate merkle trees on the fly, you'll hit v1=14 with
>>>> id=1, but you'll miss id=2 and id=3.  You need to get them all up front,
>>>> and in sorted order, before you calculate the hash.  You actually need to
>>>> transform the data to this, prior to calculating the tree:
>>>>
>>>> v1 | id
>>>> ----+----
>>>>  11 | 10
>>>>  14 |  1, 2, 3
>>>>  10 | 19
>>>>
>>>> Without an index you need to do one of the following over a dataset
>>>> that's hundreds of GB:
>>>>
>>>> * for each partition, scan the entire range for all the data, then sort
>>>> that partition in memory, then calculate the hash
>>>> * collect the entire dataset in memory, transform and sort it
>>>> * use a local index which has the keys already sorted
>>>>
>>>> A similar problem exists when trying to resolve the mismatches.
>>>>
>>>> Unless I'm missing some critical detail, I can't see how this will work
>>>> without requiring nodes have hundreds of GB of RAM or we do several orders
>>>> of magnitude more I/O than a normal repair.
>>>>
>>>> Jon
>>>>
>>>>
>>>>
>>>> On Thu, May 15, 2025 at 9:09 PM Runtian Liu <curly...@gmail.com> wrote:
>>>>
>>>> Thank you for the thoughtful questions, Jon. I really appreciate
>>>> them—let me go through them one by one.
>>>> ** *Do you intend on building all the Merkle trees in parallel?
>>>>
>>>> Since we take a snapshot to "freeze" the dataset, we don’t need to
>>>> build all Merkle trees in parallel.
>>>>
>>>>
>>>> * Will there be hundreds of files doing random IO to persist the trees
>>>> to disk, in addition to the sequential IO from repair?
>>>>
>>>> The Merkle tree will only be persisted after the entire range scan is
>>>> complete.
>>>>
>>>>
>>>> * Is the intention of persisting the trees to disk to recover from
>>>> failure, or just to limit memory usage?
>>>>
>>>> This is primarily to limit memory usage. As you may have noticed, MV
>>>> repair needs to coordinate across the entire cluster rather than just a few
>>>> nodes. This process may take very long time and it may node may restart or
>>>> do other operations during the time.
>>>>
>>>>
>>>> ** *Have you calculated the Merkle tree space requirements?
>>>> This is a very good question—I'll add it to the CEP as well. Each leaf
>>>> node stores a 32-byte hash. With a tree depth of 15 (which is on the higher
>>>> end—smaller datasets might use fewer than 10 levels), a single Merkle tree
>>>> would be approximately 32 × 2¹⁵ bytes, or 1 MB. If we split the tokens into
>>>> 10 ranges per node, we’ll end up with around 100 Merkle trees per node,
>>>> totaling roughly 100 MB.
>>>> * When do we build the Merkle trees for the view?  Is that happening in
>>>> parallel with the base table?  Do we have the computational complexity of 2
>>>> full cluster repairs running simultaneously, or does it take twice as long?
>>>>
>>>> As mentioned earlier, this can be done in parallel with the base table
>>>> or after building the base table’s Merkle tree, since we’re using a
>>>> snapshot to “freeze” the data.
>>>>
>>>> > I'm very curious to hear if anyone has run a full cluster repair
>>>> recently on a non-trivial dataset.  Every cluster I work with only does
>>>> subrange repair.  I can't even recall the last time I did a full repair on
>>>> a large cluster.  I may never have, now that I think about it.  Every time
>>>> I've done this in the past it's been plagued with issues, both in terms of
>>>> performance and reliability.  Subrange repair works because it can make
>>>> progress in 15-30 minute increments.
>>>> When we run a full repair, we trigger subrange repair on one node, then
>>>> proceed to the next subrange, and continue this way until the node's entire
>>>> primary range is repaired. After that, we move on to the next node—correct?
>>>> The complexity comparison between full repair and the proposed MV repair is
>>>> meant to compare the cost of repairing the entire dataset, not just a
>>>> subrange.
>>>>
>>>> For the example you mentioned, let me explain how it works using the
>>>> schema—without needing to create an index to build the Merkle trees.
>>>>
>>>> Suppose we have a node that owns the token range 1–30, and we have a
>>>> few records in the base table and its corresponding MV:
>>>>
>>>>    -
>>>>
>>>>    Base table: (1, 1), (2, 11), (12, 1), (23, 1)
>>>>    -
>>>>
>>>>    MV: (1, 1), (1, 12), (1, 23), (2, 11)
>>>>
>>>> When we run a full repair, we divide the node’s range into subranges of
>>>> size 10, we have r=3 ranges in total.
>>>>
>>>> First, we repair the range (1–10). The records (1, 1) and (2, 11) fall
>>>> into this range and are used to build the first Merkle tree, which is then
>>>> compared with the corresponding tree from another replica.
>>>>
>>>> Next, we repair the range (11–20). Here, the record (12, 1) is used to
>>>> build the second Merkle tree.
>>>>
>>>> Finally, we repair the range (21–30), using the record (23, 1) to build
>>>> the third Merkle tree, which is again compared with a replica's version.
>>>>
>>>> In MV repair, we still use a subrange size of 10. The key difference is
>>>> that each Merkle tree is responsible for data not just based on the base
>>>> table's partition key, but also on the MV's partition key.
>>>>
>>>> For example, when scanning the base table over the range (1–10):
>>>>
>>>>    -
>>>>
>>>>    In full repair, we generate one Merkle tree for that subrange.
>>>>    -
>>>>
>>>>    In MV repair, we generate *r* = 3 Merkle trees, one for each MV
>>>>    partition key range.
>>>>
>>>>
>>>> This means the record (1, 1) will go into the first tree because the MV
>>>> partition key is 1, while (2, 11) will go into the second tree because its
>>>> MV key is 11. The third tree will be empty because there is no record with
>>>> base table key in (1-10) and MV key in (20-30).
>>>>
>>>> After scanning the base table range (1–10), we proceed to the next
>>>> range, (11–20), and again generate 3 Merkle trees, followed by the last
>>>> range. This is why the total number of Merkle trees is *r²*—in this
>>>> case, 9 trees need to be built for the entire table.
>>>>
>>>> A similar idea applies when scanning the MV to build Merkle trees.
>>>> Essentially, for MV repair, each Merkle tree represents two-dimensional
>>>> data, unlike normal repair where it only represents one dimension. Each
>>>> Merkle tree represents the data that maps to Range(x) in the base table and
>>>> Range(y) in the MV.
>>>>
>>>>
>>>> In full repair, tokens must be sorted when adding to the Merkle tree
>>>> because the tree is built from the leaves—records are added sequentially
>>>> from left to right.
>>>> For MV repair, since the leaf nodes are sorted by the MV partition key,
>>>> a base table row can be inserted into any leaf node. This means we must
>>>> insert each hash starting from the root instead of directly at the leaf.
>>>> As noted in the comparison table, this increases complexity:
>>>>
>>>>
>>>>
>>>>    -
>>>>
>>>>    In full repair, Merkle tree building is *O(1)* per row—each hash is
>>>>    added sequentially to the leaf nodes.
>>>>    -
>>>>
>>>>    In MV repair, each hash must be inserted from the root, making it
>>>>    *O(d)* per row.
>>>>
>>>>
>>>> Since *d* (the tree depth) is typically small—less than 20 and often
>>>> smaller than in full repair—this added complexity isn’t a major concern in
>>>> practice. The reason it is smaller than full repair is that, with the above
>>>> example, we use 3 trees to represent the same amount of data while full
>>>> repair uses 1 tree.
>>>>
>>>>
>>>>
>>>>
>>>> Note that within each leaf node, the order in which hashes are added
>>>> doesn’t matter. Cassandra repair currently enforces sorted input only to
>>>> ensure that leaf nodes are built from left to right.
>>>>
>>>> > So let's say we find a mismatch in a hash.  That indicates that
>>>> there's some range of data where we have an issue.  For some token range
>>>> calculated from the v1 field, we have a mismatch, right?  What do we do
>>>> with that information?
>>>> With the above example being said, when we identify a range mismatch,
>>>> it means we’ve found that data within the base table primary key range
>>>> (a–b) and MV primary key range (m–n) has inconsistencies. We only need to
>>>> rebuild this specific data.
>>>> This allows us to easily locate the base table node that owns range
>>>> (a–b) and rebuild only the affected MV partition key range (m–n).
>>>>
>>>> * Will there be coordination between all nodes in the cluster to ensure
>>>> you don't have to do multiple scans?
>>>>
>>>>
>>>> Yes, coordination is important for this type of repair. With the
>>>> proposed solution, we can detect mismatches between the base table and the
>>>> MV by scanning data from each of them just once.
>>>> However, this doesn't mean all nodes need to be healthy during the
>>>> repair. You can think of all the Merkle trees as forming a 2D matrix—if one
>>>> node is down, it corresponds to one row and one column being unavailable
>>>> for comparison. The remaining cells can still be used for mismatch
>>>> detection.
>>>>
>>>>
>>>>
>>>> Please don’t hesitate to let me know if anything is unclear or if you
>>>> have any further questions or concerns—I’d be happy to discuss them.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Runtian
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, May 15, 2025 at 6:34 PM Jon Haddad <j...@rustyrazorblade.com>
>>>> wrote:
>>>>
>>>> One last thing.  I'm pretty sure building the tree requires the keys be
>>>> added in token order:
>>>> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173
>>>>
>>>> Which definitely introduces a bit of a problem, given that the tree
>>>> would be constructed from the transformed v1, which is a value
>>>> unpredictable enough to be considered random.
>>>>
>>>> The only way I can think of to address this would be to maintain a
>>>> local index on v1.  See my previous email where I mentioned this.
>>>>
>>>> Base Table -> Local Index -> Global Index
>>>>
>>>> Still a really hard problem.
>>>>
>>>> Jon
>>>>
>>>>
>>>>
>>>> On Thu, May 15, 2025 at 6:12 PM Jon Haddad <j...@rustyrazorblade.com>
>>>> wrote:
>>>>
>>>> There's a lot here that's still confusing to me.  Maybe you can help me
>>>> understand it better?  Apologies in advance for the text wall :)
>>>>
>>>> I'll use this schema as an example:
>>>>
>>>> ---------
>>>> CREATE TABLE test.t1 (
>>>>     id int PRIMARY KEY,
>>>>     v1 int
>>>> );
>>>>
>>>> create MATERIALIZED VIEW  test_mv as
>>>> SELECT v1, id from test.t1 where id is not null and v1 is not null
>>>> primary key (v1, id);
>>>> ---------
>>>>
>>>> We've got (id, v1) in the base table and (v1, id) in the MV.
>>>>
>>>> During the repair, we snapshot, and construct a whole bunch of merkle
>>>> trees.  CEP-48 says they will be persisted to disk.
>>>>
>>>> ** *Do you intend on building all the Merkle trees in parallel?
>>>> * Will there be hundreds of files doing random IO to persist the trees
>>>> to disk, in addition to the sequential IO from repair?
>>>> * Is the intention of persisting the trees to disk to recover from
>>>> failure, or just to limit memory usage?
>>>> ** *Have you calculated the Merkle tree space requirements?
>>>> * When do we build the Merkle trees for the view?  Is that happening in
>>>> parallel with the base table?  Do we have the computational complexity of 2
>>>> full cluster repairs running simultaneously, or does it take twice as long?
>>>>
>>>> I'm very curious to hear if anyone has run a full cluster repair
>>>> recently on a non-trivial dataset.  Every cluster I work with only does
>>>> subrange repair.  I can't even recall the last time I did a full repair on
>>>> a large cluster.  I may never have, now that I think about it.  Every time
>>>> I've done this in the past it's been plagued with issues, both in terms of
>>>> performance and reliability.  Subrange repair works because it can make
>>>> progress in 15-30 minute increments.
>>>>
>>>> Anyways - moving on...
>>>>
>>>> You suggest we read the base table and construct the Merkle trees based
>>>> on the transformed rows. Using my schema above, we take the v1 field and
>>>> use token(v1), to build the tree.  Assuming that a value for v1 appears
>>>> many times throughout the dataset across many partitions, how do you intend
>>>> on calculating it's hash?  If you look at Validator.rowHash [1] and
>>>> Validator.add, you'll see it's taking an UnfilteredRowIterator for an
>>>> entire partition and calculates the hash based on that.  Here's the 
>>>> comment:
>>>>
>>>>  /**
>>>>      * Called (in order) for every row present in the CF.
>>>>      * Hashes the row, and adds it to the tree being built.
>>>>      *
>>>>      * @param partition Partition to add hash
>>>>      */
>>>>     public void add(UnfilteredRowIterator partition)
>>>>
>>>> So it seems to me like you need to have the entire partition
>>>> materialized in memory before adding to the tree.    Doing that per value
>>>> v1 without an index is pretty much impossible - we'd have to scan the
>>>> entire dataset once per partition to pull out all the matching v1 values,
>>>> or you'd need to materialize the entire dataset into a local version of the
>>>> MV for that range. I don't know how you could do this.  Do you have a
>>>> workaround for this planned?  Maybe someone that knows the Merkle tree code
>>>> better can chime in.
>>>>
>>>> Maybe there's something else here I'm not aware of - please let me know
>>>> what I'm missing here if I am, it would be great to see this in the doc if
>>>> you have a solution.
>>>>
>>>> For the sake of discussion, let's assume we've moved past this and we
>>>> have our tree for a hundreds of ranges built from the base table & built
>>>> for the MV, now we move onto the comparison.
>>>>
>>>> In the doc at this point, we delete the snapshot because we have the
>>>> tree structures and we compare Merkle trees.  Then we stream mismatched
>>>> data.
>>>>
>>>> So let's say we find a mismatch in a hash.  That indicates that there's
>>>> some range of data where we have an issue.  For some token range calculated
>>>> from the v1 field, we have a mismatch, right?  What do we do with that
>>>> information?
>>>>
>>>> * Do we tell the node that owned the base table - hey, stream the data
>>>> from base where token(v1) is in range [X,Y) to me?
>>>> * That means we have to scan through the base again for all rows where
>>>> token(v1) in [X,Y) range, right?  Because without an index on the hashes of
>>>> v1, we're doing a full table scan and hashing every v1 value to find out if
>>>> it needs to be streamed back to the MV.
>>>> * Are we doing this concurrently on all nodes?
>>>> * Will there be coordination between all nodes in the cluster to ensure
>>>> you don't have to do multiple scans?
>>>>
>>>> I realized there's a lot of questions here, but unfortunately I'm
>>>> having a hard time seeing how we can workaround some of the core
>>>> assumptions around constructing Merkle trees and using them to resolve the
>>>> differences in a way that matches up with what's in the doc.  I have quite
>>>> a few more things to discuss, but I'll save them for a follow up once all
>>>> these have been sorted out.
>>>>
>>>> Thanks in advance!
>>>> Jon
>>>>
>>>> [1]
>>>> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209
>>>>
>>>>
>>>>
>>>> On Thu, May 15, 2025 at 10:10 AM Runtian Liu <curly...@gmail.com>
>>>> wrote:
>>>>
>>>> The previous table compared the complexity of full repair and MV repair
>>>> when reconciling one dataset with another. In production, we typically use
>>>> a replication factor of 3 in one datacenter. This means full repair
>>>> involves 3n rows, while MV repair involves comparing 6n rows (base + MV).
>>>> Below is an updated comparison table reflecting this scenario.
>>>>
>>>> n: number of rows to repair (Total rows in the table)
>>>>
>>>> d: depth of one Merkle tree for MV repair
>>>>
>>>> r: number of split ranges
>>>>
>>>> p: data compacted away
>>>>
>>>>
>>>> This comparison focuses on the complexities of one round of full repair
>>>> with a replication factor of 3 versus repairing a single MV based on one
>>>> base table with replication factor 3.
>>>>
>>>> *Full Repair*
>>>>
>>>> *MV Repair*
>>>>
>>>> *Comment*
>>>>
>>>> Extra disk used
>>>>
>>>> 0
>>>>
>>>> O(2*p)
>>>>
>>>> Since we take a snapshot at the beginning of the repair, any disk space
>>>> that would normally be freed by compaction will remain occupied until the
>>>> Merkle trees are successfully built and the snapshot is cleared.
>>>>
>>>> Data scan complexity
>>>>
>>>> O(3*n)
>>>>
>>>> O(6*n)
>>>>
>>>> Full repair scans *n* rows from the primary and 2*n* from replicas.3
>>>>
>>>> MV repair scans 3n rows from the base table and 3n from the MV.
>>>>
>>>> Merkle Tree building time complexity
>>>>
>>>> O(3n)
>>>>
>>>> O(6*n*d)
>>>>
>>>> In full repair, Merkle tree building is *O(1)* per row—each hash is
>>>> added sequentially to the leaf nodes.
>>>>
>>>> In MV repair, each hash is inserted from the root, making it *O(d)*
>>>> per row. Since *d* is typically small (less than 20 and often smaller
>>>> than in full repair), this isn’t a major concern.
>>>>
>>>> Total Merkle tree count
>>>>
>>>> O(3*r)
>>>>
>>>> O(6*r^2)
>>>>
>>>> MV repair needs to generate more, smaller Merkle trees, but this isn’t
>>>> a concern as they can be persisted to disk during the repair process.
>>>>
>>>> Merkle tree comparison complexity
>>>>
>>>> O(3n)
>>>>
>>>> O(3n)
>>>>
>>>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>>>
>>>> Stream time complexity
>>>>
>>>> O(3n)
>>>>
>>>> O(3n)
>>>>
>>>> Assuming all rows need to be streamed, both repairs are equivalent.
>>>>
>>>> *In short:* Even for production use cases having RF=3 in one data
>>>> center, we can see that the MV repair consumes temporary disk space and a
>>>> small, usually negligible amount of extra CPU for tree construction; other
>>>> costs match full repair.
>>>>
>>>> Additionally, with the online path proposed in this CEP, we expect
>>>> mismatches to be rare, which can lower the frequency of running this repair
>>>> process compared to full repair.
>>>>
>>>>
>>>> On Thu, May 15, 2025 at 9:53 AM Jon Haddad <j...@rustyrazorblade.com>
>>>> wrote:
>>>>
>>>> > They are not two unordered sets, but rather two sets ordered by
>>>> different keys.
>>>>
>>>> I think this is a distinction without a difference. Merkle tree repair
>>>> works because the ordering of the data is mostly the same across nodes.
>>>>
>>>>
>>>> On Thu, May 15, 2025 at 9:27 AM Runtian Liu <curly...@gmail.com> wrote:
>>>>
>>>> > what we're trying to achieve here is comparing two massive unordered
>>>> sets.
>>>>
>>>> They are not two unordered sets, but rather two sets ordered by
>>>> different keys. This means that when building Merkle trees for the base
>>>> table and the materialized view (MV), we need to use different strategies
>>>> to ensure the trees can be meaningfully compared.
>>>>
>>>> To address scalability concerns for MV repair, I’ve included a
>>>> comparison between one round of full repair and MV repair in the table
>>>> below. This comparison is also added to the CEP.
>>>>
>>>> n: number of rows to repair (Total rows in the table)
>>>>
>>>> d: depth of one Merkle tree for MV repair
>>>>
>>>> r: number of split ranges
>>>>
>>>> p: data compacted away
>>>>
>>>>
>>>> This comparison focuses on the complexities of one round of full repair
>>>> with a replication factor of 2 versus repairing a single MV based on one
>>>> base table replica.
>>>>
>>>> *Full Repair*
>>>>
>>>> *MV Repair*
>>>>
>>>> *Comment*
>>>>
>>>> Extra disk used
>>>>
>>>> 0
>>>>
>>>> O(2*p)
>>>>
>>>> Since we take a snapshot at the beginning of the repair, any disk space
>>>> that would normally be freed by compaction will remain occupied until the
>>>> Merkle trees are successfully built and the snapshot is cleared.
>>>>
>>>> Data scan complexity
>>>>
>>>> O(2*n)
>>>>
>>>> O(2*n)
>>>>
>>>> Full repair scans *n* rows from the primary and *n* from replicas.
>>>>
>>>> MV repair scans *n* rows from the base table primary replica only, and
>>>> *n* from the MV primary replica only.
>>>>
>>>> Merkle Tree building time complexity
>>>>
>>>> O(n)
>>>>
>>>> O(n*d)
>>>>
>>>> In full repair, Merkle tree building is *O(1)* per row—each hash is
>>>> added sequentially to the leaf nodes.
>>>>
>>>> In MV repair, each hash is inserted from the root, making it *O(d)*
>>>> per row. Since *d* is typically small (less than 20 and often smaller
>>>> than in full repair), this isn’t a major concern.
>>>>
>>>> Total Merkle tree count
>>>>
>>>> O(2*r)
>>>>
>>>> O(2*r^2)
>>>>
>>>> MV repair needs to generate more, smaller Merkle trees, but this isn’t
>>>> a concern as they can be persisted to disk during the repair process.
>>>>
>>>> Merkle tree comparison complexity
>>>>
>>>> O(n)
>>>>
>>>> O(n)
>>>>
>>>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>>>
>>>> Stream time complexity
>>>>
>>>> O(n)
>>>>
>>>> O(n)
>>>>
>>>> Assuming all rows need to be streamed, both repairs are equivalent.
>>>>
>>>> *In short:* MV repair consumes temporary disk space and a small,
>>>> usually negligible amount of extra CPU for tree construction; other costs
>>>> match full repair.
>>>>
>>>> The core idea behind the proposed MV repair is as follows:
>>>>
>>>>    1.
>>>>
>>>>    Take a snapshot to “freeze” the current state of both the base
>>>>    table and its MV.
>>>>    2.
>>>>
>>>>    Gradually scan the data from both tables to build Merkle trees.
>>>>    3.
>>>>
>>>>    Identify the token ranges where inconsistencies exist.
>>>>    4.
>>>>
>>>>    Rebuild only the mismatched ranges rather than the entire MV.
>>>>
>>>> With transaction-backed MVs, step 4 should rarely be necessary.
>>>>
>>>> On Thu, May 15, 2025 at 7:54 AM Josh McKenzie <jmcken...@apache.org>
>>>> wrote:
>>>>
>>>>
>>>> I think in order to address this, the view should be propagated to the
>>>> base replicas *after* it's accepted by all or a majority of base replicas.
>>>> This is where I think mutation tracking could probably help.
>>>>
>>>> Yeah, the idea of "don't reflect in the MV until you hit the CL the
>>>> user requested for the base table". Introduces disjoint risk if you have
>>>> coordinator death mid-write where replicas got base-data but that 2nd step
>>>> didn't take place; think that's why Runtien et. al are looking at paxos
>>>> repair picking up those pieces for you after the fact to get you back into
>>>> consistency. Mutation tracking and Accord both have similar guarantees in
>>>> this space.
>>>>
>>>> I think this would ensure that as long as there's no data loss or
>>>> bit-rot, the base and view can be repaired independently. When there is
>>>> data loss or bit-rot in either the base table or the view, then it is the
>>>> same as 2i today: rebuild is required.
>>>>
>>>> And the repair as proposed in the CEP should resolve the bitrot and bug
>>>> dataloss case I think. Certainly has much higher time complexity but the
>>>> bounding of memory complexity to be comparable with regular repair doesn't
>>>> strike me as a dealbreaker.
>>>>
>>>> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
>>>>
>>>> >  I think requiring a rebuild is a deal breaker for most teams.  In
>>>> most instances it would be having to also expand the cluster to handle the
>>>> additional disk requirements.  It turns an inconsistency problem into a
>>>> major operational headache that can take weeks to resolve.
>>>>
>>>> Agreed. The rebuild would not be required during normal operations when
>>>> the cluster is properly maintained (ie. regular repair) - only in
>>>> catastrophic situations.   This is also the case for ordinary tables
>>>> currently: if there's data loss, then restoring from a backup is needed.
>>>> This could be a possible alternative to not require a rebuild in this
>>>> extraordinary scenario.
>>>>
>>>> On Thu, May 15, 2025 at 10:14 AM Jon Haddad <j...@rustyrazorblade.com>
>>>> wrote:
>>>>
>>>> I think requiring a rebuild is a deal breaker for most teams.  In most
>>>> instances it would be having to also expand the cluster to handle the
>>>> additional disk requirements.  It turns an inconsistency problem into a
>>>> major operational headache that can take weeks to resolve.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, May 15, 2025 at 7:02 AM Paulo Motta <pauloricard...@gmail.com>
>>>> wrote:
>>>>
>>>> > There's bi-directional entropy issues with MV's - either orphaned
>>>> view data or missing view data; that's why you kind of need a
>>>> "bi-directional ETL" to make sure the 2 agree with each other. While normal
>>>> repair would resolve the "missing data in MV" case, it wouldn't resolve the
>>>> "data in MV that's not in base table anymore" case, which afaict all base
>>>> consistency approaches (status quo, PaxosV2, Accord, Mutation Tracking) are
>>>> vulnerable to.
>>>>
>>>> I don't think that bi-directional reconciliation should be a
>>>> requirement, when the base table is assumed to be the source of truth as
>>>> stated in the CEP doc.
>>>>
>>>> I think the main issue with the current MV implementation is that each
>>>> view replica is independently replicated by the base replica, before the
>>>> base write is acknowledged.
>>>>
>>>> This creates a correctness issue in the write path, because a view
>>>> update can be created for a write that was not accepted by the coordinator
>>>> in the following scenario:
>>>>
>>>> N=RF=3
>>>> CL=ONE
>>>> - Update U is propagated to view replica V, coordinator that is also
>>>> base replica B dies before accepting base table write request to client.
>>>> Now U exists in V but not in B.
>>>>
>>>> I think in order to address this, the view should be propagated to the
>>>> base replicas *after* it's accepted by all or a majority of base replicas.
>>>> This is where I think mutation tracking could probably help.
>>>>
>>>> I think this would ensure that as long as there's no data loss or
>>>> bit-rot, the base and view can be repaired independently. When there is
>>>> data loss or bit-rot in either the base table or the view, then it is the
>>>> same as 2i today: rebuild is required.
>>>>
>>>> >  It'd be correct (if operationally disappointing) to be able to just
>>>> say "if you have data loss in your base table you need to rebuild the
>>>> corresponding MV's", but the problem is operators aren't always going to
>>>> know when that data loss occurs. Not everything is as visible as a lost
>>>> quorum of replicas or blown up SSTables.
>>>>
>>>> I think there are opportunities to improve rebuild speed, assuming the
>>>> base table as a source of truth. For example, rebuild only subranges when
>>>> data-loss is detected.
>>>>
>>>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie <jmcken...@apache.org>
>>>> wrote:
>>>>
>>>>
>>>> There's bi-directional entropy issues with MV's - either orphaned view
>>>> data or missing view data; that's why you kind of need a "bi-directional
>>>> ETL" to make sure the 2 agree with each other. While normal repair would
>>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>>>> that's not in base table anymore" case, which afaict all base consistency
>>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>>>> to.
>>>>
>>>> It'd be correct (if operationally disappointing) to be able to just say
>>>> "if you have data loss in your base table you need to rebuild the
>>>> corresponding MV's", but the problem is operators aren't always going to
>>>> know when that data loss occurs. Not everything is as visible as a lost
>>>> quorum of replicas or blown up SSTables.
>>>>
>>>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>>>
>>>> Maybe, I’m not really familiar enough with how “classic” MV repair
>>>> works to say. You can’t mix normal repair and mutation reconciliation in
>>>> the current incarnation of mutation tracking though, so I wouldn’t assume
>>>> it would work with MVs.
>>>>
>>>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>>>
>>>> In the case of bitrot / losing an SSTable, wouldn't a normal repair
>>>> (just the MV against the other nodes) resolve the issue?
>>>>
>>>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com>
>>>> wrote:
>>>>
>>>>
>>>> Mutation tracking is definitely an approach you could take for MVs.
>>>> Mutation reconciliation could be extended to ensure all changes have been
>>>> replicated to the views. When a base table received a mutation w/ an id it
>>>> would generate a view update. If you block marking a given mutation id as
>>>> reconciled until it’s been fully replicated to the base table and its view
>>>> updates have been fully replicated to the views, then all view updates will
>>>> eventually be applied as part of the log reconciliation process.
>>>>
>>>> A mutation tracking implementation would also allow you to be more
>>>> flexible with the types of consistency levels you can work with, allowing
>>>> users to do things like use LOCAL_QUORUM without leaving themselves open to
>>>> introducing view inconsistencies.
>>>>
>>>> That would more or less eliminate the need for any MV repair in normal
>>>> usage, but wouldn't address how to repair issues caused by bugs or data
>>>> loss, though you may be able to do something with comparing the latest
>>>> mutation ids for the base tables and its view ranges.
>>>>
>>>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>>>>
>>>> I don't see mutation tracking [1] mentioned in this thread or in the
>>>> CEP-48 description. Not sure this would fit into the scope of this
>>>> initial CEP, but I have a feeling that mutation tracking could be
>>>> potentially helpful to reconcile base tables and views ?
>>>>
>>>> For example, when both base and view updates are acknowledged then this
>>>> could be somehow persisted in the view sstables mutation tracking
>>>> summary[2] or similar metadata ? Then these updates would be skipped during
>>>> view repair, considerably reducing the amount of work needed, since only
>>>> un-acknowledged views updates would need to be reconciled.
>>>>
>>>> [1] -
>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C>
>>>> [2] -  <https://issues.apache.org/jira/browse/CASSANDRA-20336>
>>>>
>>>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to