Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Jon Haddad Sat, 17 May 2025 14:40:31 -0700

Yeah, this is exactly what i suggested in a different part of the thread.
The transformative repair could be done against the local index, and the
local index can repair against the global index. It opens up a lot of
possibilities, query wise, as well.




On Sat, May 17, 2025 at 1:47 PM Blake Eggleston <bl...@ultrablake.com>
wrote:

> > They are not two unordered sets, but rather two sets ordered by
> different keys.
>
> I think you could exploit this to improve your MV repair design. Instead
> of taking global snapshots and persisting merkle trees, you could implement
> a set of secondary indexes on the base and view tables that you could
> quickly compare the contents of for repair.
>
> The indexes would have their contents sorted by base table partition
> order, but segmented by view partition ranges. Then any view <-> base
> repair would compare the intersecting index slices. That would allow you to
> repair data more quickly and with less operational complexity.
>
> On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> For example, in the chart above, each cell represents a Merkle tree that
> covers data belonging to a specific base table range and a specific MV
> range. When we scan a base table range, we can generate the Merkle trees
> marked in red. When we scan an MV range, we can generate the Merkle trees
> marked in green. The cells that can be compared are marked in blue.
>
> To save time and CPU resources, we persist the Merkle trees created during
> a scan so we don’t need to regenerate them later. This way, when other
> nodes scan and build Merkle trees based on the same “frozen” snapshot, we
> can reuse the existing Merkle trees for comparison.
>
> On Fri, May 16, 2025 at 12:22 PM Runtian Liu <curly...@gmail.com> wrote:
>
> Unfortunately, no. When building Merkle trees for small token ranges in
> the base table, those ranges may span the entire MV token range. As a
> result, we need to scan the entire MV to generate all the necessary Merkle
> trees. For efficiency, we perform this as a single pass over the entire
> table rather than scanning a small range of the base or MV table
> individually. As you mentioned, with storage becoming increasingly
> affordable, this approach helps us save time and CPU resources.
>
> On Fri, May 16, 2025 at 12:11 PM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
> I spoke too soon - endless questions are not over :)
>
> Since the data that's going to be repaired only covers a range, I wonder
> if it makes sense to have the ability to issue a minimalist snapshot that
> only hardlinks SSTables that are in a token range.  Based on what you
> (Runtian) have said above, only a small percentage of the data would
> actually be repaired at any given time.
>
> Just a thought to save a little filesystem churn.
>
>
> On Fri, May 16, 2025 at 10:55 AM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
> Nevermind about the height thing i guess its the same property.
>
> I’m done for now :)
>
> Thanks for entertaining my endless questions. My biggest concerns about
> repair have been alleviated.
>
> Jon
>
> On Fri, May 16, 2025 at 10:34 AM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
> Thats the critical bit i was missing, thank you Blake.
>
> I guess we’d need to have unlimited height trees then, since you’d need to
> be able to update the hashes of individual partitions, and we’d also need
> to propagate the hashes up every time as well. I’m curious what the cost
> will look like with that.
>
> At least it’s a cpu problem not an I/O one.
>
> Jon
>
>
> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
> The merkle tree xor's the individual row hashes together, which is
> commutative. So you should be able to build a tree in the view token order
> while reading in base table token order and vise versa.
>
> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
>
> Thanks for the explanation, I appreciate it.  I think you might still be
> glossing over an important point - which I'll make singularly here.
> There's a number of things I'm concerned about, but this is a big one.
>
> Calculating the hash of a partition for a Merkle tree needs to be done on
> the fully materialized, sorted partition.
>
> The examples you're giving are simple, to the point where they hide the
> problem.  Here's a better example, where the MV has a clustering column. In
> the MV's partition it'll have multiple rows, but in the base table it'll be
> stored in different pages or different SSTables entirely:
>
> CREATE TABLE test.t1 (
>     id int PRIMARY KEY,
>     v1 int
> );
>
> CREATE MATERIALIZED VIEW test.test_mv AS
>     SELECT v1, id
>     FROM test.t1
>     WHERE id IS NOT NULL AND v1 IS NOT NULL
>     PRIMARY KEY (v1, id)
>  WITH CLUSTERING ORDER BY (id ASC);
>
>
> Let's say we have some test data:
>
> cqlsh:test> select id, v1 from t1;
>
>  id | v1
> ----+----
>  10 | 11
>   1 | 14
>  19 | 10
>   2 | 14
>   3 | 14
>
> When we transform the data by iterating over the base table, we get this
> representation (note v1=14):
>
> cqlsh:test> select v1, id from t1;
>
>  v1 | id
> ----+----
>  11 | 10
>  14 |  1   <------
>  10 | 19
>  14 |  2 <------
>  14 |  3  <------
>
>
> The partiton key in the new table is v1.  If you simply iterate and
> transform and calculate merkle trees on the fly, you'll hit v1=14 with
> id=1, but you'll miss id=2 and id=3.  You need to get them all up front,
> and in sorted order, before you calculate the hash.  You actually need to
> transform the data to this, prior to calculating the tree:
>
> v1 | id
> ----+----
>  11 | 10
>  14 |  1, 2, 3
>  10 | 19
>
> Without an index you need to do one of the following over a dataset that's
> hundreds of GB:
>
> * for each partition, scan the entire range for all the data, then sort
> that partition in memory, then calculate the hash
> * collect the entire dataset in memory, transform and sort it
> * use a local index which has the keys already sorted
>
> A similar problem exists when trying to resolve the mismatches.
>
> Unless I'm missing some critical detail, I can't see how this will work
> without requiring nodes have hundreds of GB of RAM or we do several orders
> of magnitude more I/O than a normal repair.
>
> Jon
>
>
>
> On Thu, May 15, 2025 at 9:09 PM Runtian Liu <curly...@gmail.com> wrote:
>
> Thank you for the thoughtful questions, Jon. I really appreciate them—let
> me go through them one by one.
> ** *Do you intend on building all the Merkle trees in parallel?
>
> Since we take a snapshot to "freeze" the dataset, we don’t need to build
> all Merkle trees in parallel.
>
>
> * Will there be hundreds of files doing random IO to persist the trees to
> disk, in addition to the sequential IO from repair?
>
> The Merkle tree will only be persisted after the entire range scan is
> complete.
>
>
> * Is the intention of persisting the trees to disk to recover from
> failure, or just to limit memory usage?
>
> This is primarily to limit memory usage. As you may have noticed, MV
> repair needs to coordinate across the entire cluster rather than just a few
> nodes. This process may take very long time and it may node may restart or
> do other operations during the time.
>
>
> ** *Have you calculated the Merkle tree space requirements?
> This is a very good question—I'll add it to the CEP as well. Each leaf
> node stores a 32-byte hash. With a tree depth of 15 (which is on the higher
> end—smaller datasets might use fewer than 10 levels), a single Merkle tree
> would be approximately 32 × 2¹⁵ bytes, or 1 MB. If we split the tokens into
> 10 ranges per node, we’ll end up with around 100 Merkle trees per node,
> totaling roughly 100 MB.
> * When do we build the Merkle trees for the view?  Is that happening in
> parallel with the base table?  Do we have the computational complexity of 2
> full cluster repairs running simultaneously, or does it take twice as long?
>
> As mentioned earlier, this can be done in parallel with the base table or
> after building the base table’s Merkle tree, since we’re using a snapshot
> to “freeze” the data.
>
> > I'm very curious to hear if anyone has run a full cluster repair
> recently on a non-trivial dataset.  Every cluster I work with only does
> subrange repair.  I can't even recall the last time I did a full repair on
> a large cluster.  I may never have, now that I think about it.  Every time
> I've done this in the past it's been plagued with issues, both in terms of
> performance and reliability.  Subrange repair works because it can make
> progress in 15-30 minute increments.
> When we run a full repair, we trigger subrange repair on one node, then
> proceed to the next subrange, and continue this way until the node's entire
> primary range is repaired. After that, we move on to the next node—correct?
> The complexity comparison between full repair and the proposed MV repair is
> meant to compare the cost of repairing the entire dataset, not just a
> subrange.
>
> For the example you mentioned, let me explain how it works using the
> schema—without needing to create an index to build the Merkle trees.
>
> Suppose we have a node that owns the token range 1–30, and we have a few
> records in the base table and its corresponding MV:
>
>    -
>
>    Base table: (1, 1), (2, 11), (12, 1), (23, 1)
>    -
>
>    MV: (1, 1), (1, 12), (1, 23), (2, 11)
>
> When we run a full repair, we divide the node’s range into subranges of
> size 10, we have r=3 ranges in total.
>
> First, we repair the range (1–10). The records (1, 1) and (2, 11) fall
> into this range and are used to build the first Merkle tree, which is then
> compared with the corresponding tree from another replica.
>
> Next, we repair the range (11–20). Here, the record (12, 1) is used to
> build the second Merkle tree.
>
> Finally, we repair the range (21–30), using the record (23, 1) to build
> the third Merkle tree, which is again compared with a replica's version.
>
> In MV repair, we still use a subrange size of 10. The key difference is
> that each Merkle tree is responsible for data not just based on the base
> table's partition key, but also on the MV's partition key.
>
> For example, when scanning the base table over the range (1–10):
>
>    -
>
>    In full repair, we generate one Merkle tree for that subrange.
>    -
>
>    In MV repair, we generate *r* = 3 Merkle trees, one for each MV
>    partition key range.
>
>
> This means the record (1, 1) will go into the first tree because the MV
> partition key is 1, while (2, 11) will go into the second tree because its
> MV key is 11. The third tree will be empty because there is no record with
> base table key in (1-10) and MV key in (20-30).
>
> After scanning the base table range (1–10), we proceed to the next range,
> (11–20), and again generate 3 Merkle trees, followed by the last range.
> This is why the total number of Merkle trees is *r²*—in this case, 9
> trees need to be built for the entire table.
>
> A similar idea applies when scanning the MV to build Merkle trees.
> Essentially, for MV repair, each Merkle tree represents two-dimensional
> data, unlike normal repair where it only represents one dimension. Each
> Merkle tree represents the data that maps to Range(x) in the base table and
> Range(y) in the MV.
>
>
> In full repair, tokens must be sorted when adding to the Merkle tree
> because the tree is built from the leaves—records are added sequentially
> from left to right.
> For MV repair, since the leaf nodes are sorted by the MV partition key, a
> base table row can be inserted into any leaf node. This means we must
> insert each hash starting from the root instead of directly at the leaf.
> As noted in the comparison table, this increases complexity:
>
>
>
>    -
>
>    In full repair, Merkle tree building is *O(1)* per row—each hash is
>    added sequentially to the leaf nodes.
>    -
>
>    In MV repair, each hash must be inserted from the root, making it
>    *O(d)* per row.
>
>
> Since *d* (the tree depth) is typically small—less than 20 and often
> smaller than in full repair—this added complexity isn’t a major concern in
> practice. The reason it is smaller than full repair is that, with the above
> example, we use 3 trees to represent the same amount of data while full
> repair uses 1 tree.
>
>
>
>
> Note that within each leaf node, the order in which hashes are added
> doesn’t matter. Cassandra repair currently enforces sorted input only to
> ensure that leaf nodes are built from left to right.
>
> > So let's say we find a mismatch in a hash.  That indicates that there's
> some range of data where we have an issue.  For some token range calculated
> from the v1 field, we have a mismatch, right?  What do we do with that
> information?
> With the above example being said, when we identify a range mismatch, it
> means we’ve found that data within the base table primary key range (a–b)
> and MV primary key range (m–n) has inconsistencies. We only need to rebuild
> this specific data.
> This allows us to easily locate the base table node that owns range (a–b)
> and rebuild only the affected MV partition key range (m–n).
>
> * Will there be coordination between all nodes in the cluster to ensure
> you don't have to do multiple scans?
>
>
> Yes, coordination is important for this type of repair. With the proposed
> solution, we can detect mismatches between the base table and the MV by
> scanning data from each of them just once.
> However, this doesn't mean all nodes need to be healthy during the repair.
> You can think of all the Merkle trees as forming a 2D matrix—if one node is
> down, it corresponds to one row and one column being unavailable for
> comparison. The remaining cells can still be used for mismatch detection.
>
>
>
> Please don’t hesitate to let me know if anything is unclear or if you have
> any further questions or concerns—I’d be happy to discuss them.
>
>
> Thanks,
>
> Runtian
>
>
>
>
> On Thu, May 15, 2025 at 6:34 PM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
> One last thing.  I'm pretty sure building the tree requires the keys be
> added in token order:
> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173
>
> Which definitely introduces a bit of a problem, given that the tree would
> be constructed from the transformed v1, which is a value unpredictable
> enough to be considered random.
>
> The only way I can think of to address this would be to maintain a local
> index on v1.  See my previous email where I mentioned this.
>
> Base Table -> Local Index -> Global Index
>
> Still a really hard problem.
>
> Jon
>
>
>
> On Thu, May 15, 2025 at 6:12 PM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
> There's a lot here that's still confusing to me.  Maybe you can help me
> understand it better?  Apologies in advance for the text wall :)
>
> I'll use this schema as an example:
>
> ---------
> CREATE TABLE test.t1 (
>     id int PRIMARY KEY,
>     v1 int
> );
>
> create MATERIALIZED VIEW  test_mv as
> SELECT v1, id from test.t1 where id is not null and v1 is not null primary
> key (v1, id);
> ---------
>
> We've got (id, v1) in the base table and (v1, id) in the MV.
>
> During the repair, we snapshot, and construct a whole bunch of merkle
> trees.  CEP-48 says they will be persisted to disk.
>
> ** *Do you intend on building all the Merkle trees in parallel?
> * Will there be hundreds of files doing random IO to persist the trees to
> disk, in addition to the sequential IO from repair?
> * Is the intention of persisting the trees to disk to recover from
> failure, or just to limit memory usage?
> ** *Have you calculated the Merkle tree space requirements?
> * When do we build the Merkle trees for the view?  Is that happening in
> parallel with the base table?  Do we have the computational complexity of 2
> full cluster repairs running simultaneously, or does it take twice as long?
>
> I'm very curious to hear if anyone has run a full cluster repair recently
> on a non-trivial dataset.  Every cluster I work with only does subrange
> repair.  I can't even recall the last time I did a full repair on a large
> cluster.  I may never have, now that I think about it.  Every time I've
> done this in the past it's been plagued with issues, both in terms of
> performance and reliability.  Subrange repair works because it can make
> progress in 15-30 minute increments.
>
> Anyways - moving on...
>
> You suggest we read the base table and construct the Merkle trees based on
> the transformed rows. Using my schema above, we take the v1 field and use
> token(v1), to build the tree.  Assuming that a value for v1 appears many
> times throughout the dataset across many partitions, how do you intend on
> calculating it's hash?  If you look at Validator.rowHash [1] and
> Validator.add, you'll see it's taking an UnfilteredRowIterator for an
> entire partition and calculates the hash based on that.  Here's the comment:
>
>  /**
>      * Called (in order) for every row present in the CF.
>      * Hashes the row, and adds it to the tree being built.
>      *
>      * @param partition Partition to add hash
>      */
>     public void add(UnfilteredRowIterator partition)
>
> So it seems to me like you need to have the entire partition materialized
> in memory before adding to the tree.    Doing that per value v1 without an
> index is pretty much impossible - we'd have to scan the entire dataset once
> per partition to pull out all the matching v1 values, or you'd need to
> materialize the entire dataset into a local version of the MV for that
> range. I don't know how you could do this.  Do you have a workaround for
> this planned?  Maybe someone that knows the Merkle tree code better can
> chime in.
>
> Maybe there's something else here I'm not aware of - please let me know
> what I'm missing here if I am, it would be great to see this in the doc if
> you have a solution.
>
> For the sake of discussion, let's assume we've moved past this and we have
> our tree for a hundreds of ranges built from the base table & built for the
> MV, now we move onto the comparison.
>
> In the doc at this point, we delete the snapshot because we have the tree
> structures and we compare Merkle trees.  Then we stream mismatched data.
>
> So let's say we find a mismatch in a hash.  That indicates that there's
> some range of data where we have an issue.  For some token range calculated
> from the v1 field, we have a mismatch, right?  What do we do with that
> information?
>
> * Do we tell the node that owned the base table - hey, stream the data
> from base where token(v1) is in range [X,Y) to me?
> * That means we have to scan through the base again for all rows where
> token(v1) in [X,Y) range, right?  Because without an index on the hashes of
> v1, we're doing a full table scan and hashing every v1 value to find out if
> it needs to be streamed back to the MV.
> * Are we doing this concurrently on all nodes?
> * Will there be coordination between all nodes in the cluster to ensure
> you don't have to do multiple scans?
>
> I realized there's a lot of questions here, but unfortunately I'm having a
> hard time seeing how we can workaround some of the core assumptions around
> constructing Merkle trees and using them to resolve the differences in a
> way that matches up with what's in the doc.  I have quite a few more things
> to discuss, but I'll save them for a follow up once all these have been
> sorted out.
>
> Thanks in advance!
> Jon
>
> [1]
> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209
>
>
>
> On Thu, May 15, 2025 at 10:10 AM Runtian Liu <curly...@gmail.com> wrote:
>
> The previous table compared the complexity of full repair and MV repair
> when reconciling one dataset with another. In production, we typically use
> a replication factor of 3 in one datacenter. This means full repair
> involves 3n rows, while MV repair involves comparing 6n rows (base + MV).
> Below is an updated comparison table reflecting this scenario.
>
> n: number of rows to repair (Total rows in the table)
>
> d: depth of one Merkle tree for MV repair
>
> r: number of split ranges
>
> p: data compacted away
>
>
> This comparison focuses on the complexities of one round of full repair
> with a replication factor of 3 versus repairing a single MV based on one
> base table with replication factor 3.
>
> *Full Repair*
>
> *MV Repair*
>
> *Comment*
>
> Extra disk used
>
> 0
>
> O(2*p)
>
> Since we take a snapshot at the beginning of the repair, any disk space
> that would normally be freed by compaction will remain occupied until the
> Merkle trees are successfully built and the snapshot is cleared.
>
> Data scan complexity
>
> O(3*n)
>
> O(6*n)
>
> Full repair scans *n* rows from the primary and 2*n* from replicas.3
>
> MV repair scans 3n rows from the base table and 3n from the MV.
>
> Merkle Tree building time complexity
>
> O(3n)
>
> O(6*n*d)
>
> In full repair, Merkle tree building is *O(1)* per row—each hash is added
> sequentially to the leaf nodes.
>
> In MV repair, each hash is inserted from the root, making it *O(d)* per
> row. Since *d* is typically small (less than 20 and often smaller than in
> full repair), this isn’t a major concern.
>
> Total Merkle tree count
>
> O(3*r)
>
> O(6*r^2)
>
> MV repair needs to generate more, smaller Merkle trees, but this isn’t a
> concern as they can be persisted to disk during the repair process.
>
> Merkle tree comparison complexity
>
> O(3n)
>
> O(3n)
>
> Assuming one row maps to one leaf node, both repairs are equivalent.
>
> Stream time complexity
>
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to