For each row, when calculating its hash, we first need to merge all the
SSTables that contain that row. We cannot attach a Merkle tree directly to
each SSTable, because merged Merkle trees would produce different hash
values for the same data if the compaction states differ.

On Sat, May 17, 2025 at 5:48 PM Jon Haddad <j...@rustyrazorblade.com> wrote:

> Could we could do that for regular repair as well? which would make a
> validation possible with barely any IO?
>
> Sstable attached merkle trees?
>
>
>
>
> On Sat, May 17, 2025 at 5:36 PM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
>> What if you built the merkle tree for each sstable as a storage attached
>> index?
>>
>> Then your repair is merging merkle tables.
>>
>>
>> On Sat, May 17, 2025 at 4:57 PM Runtian Liu <curly...@gmail.com> wrote:
>>
>>> > I think you could exploit this to improve your MV repair design.
>>> Instead of taking global snapshots and persisting merkle trees, you could
>>> implement a set of secondary indexes on the base and view tables that you
>>> could quickly compare the contents of for repair.
>>>
>>> We actually considered this approach while designing the MV repair.
>>> However, there are several downsides:
>>>
>>>    1.
>>>
>>>    It requires additional storage for the index files.
>>>    2.
>>>
>>>    Data scans during repair would become random disk accesses instead
>>>    of sequential ones, which can degrade performance.
>>>    3.
>>>
>>>    Most importantly, I decided against this approach due to the
>>>    complexity of ensuring index consistency. Introducing secondary indexes
>>>    opens up new challenges, such as keeping them in sync with the actual 
>>> data.
>>>
>>> The goal of the design is to provide a catch-all mismatch detection
>>> mechanism that targets the dataset users query during the online path. I
>>> did consider adding indexes at the SSTable level to guarantee consistency
>>> between indexes and data.
>>> > sorted by base table partition order, but segmented by view partition
>>> ranges
>>> If the indexes at the SSTable level, it means it will be less flexible,
>>> we need to rewrite the SSTables if we decide to range the view partition
>>> ranges.
>>> I didn’t explore this direction further due to the issues listed above.
>>>
>>> > The transformative repair could be done against the local index, and
>>> the local index can repair against the global index. It opens up a lot of
>>> possibilities, query wise, as well.
>>> This is something I’m not entirely sure about—how exactly do we use the
>>> local index to support the global index (i.e., the MV)? If the MV relies on
>>> local indexes during the query path, we can definitely dig deeper into how
>>> repair could work with that design.
>>>
>>> The proposed design in this CEP aims to treat the base table and its MV
>>> like any other regular tables, so that operations such as compaction and
>>> repair can be handled in the same way in most cases.
>>>
>>> On Sat, May 17, 2025 at 2:42 PM Jon Haddad <j...@rustyrazorblade.com>
>>> wrote:
>>>
>>>> Yeah, this is exactly what i suggested in a different part of the
>>>> thread. The transformative repair could be done against the local index,
>>>> and the local index can repair against the global index. It opens up a lot
>>>> of possibilities, query wise, as well.
>>>>
>>>>
>>>>
>>>> On Sat, May 17, 2025 at 1:47 PM Blake Eggleston <bl...@ultrablake.com>
>>>> wrote:
>>>>
>>>>> > They are not two unordered sets, but rather two sets ordered by
>>>>> different keys.
>>>>>
>>>>> I think you could exploit this to improve your MV repair design.
>>>>> Instead of taking global snapshots and persisting merkle trees, you could
>>>>> implement a set of secondary indexes on the base and view tables that you
>>>>> could quickly compare the contents of for repair.
>>>>>
>>>>> The indexes would have their contents sorted by base table partition
>>>>> order, but segmented by view partition ranges. Then any view <-> base
>>>>> repair would compare the intersecting index slices. That would allow you 
>>>>> to
>>>>> repair data more quickly and with less operational complexity.
>>>>>
>>>>> On Fri, May 16, 2025, at 12:32 PM, Runtian Liu wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> For example, in the chart above, each cell represents a Merkle tree
>>>>> that covers data belonging to a specific base table range and a specific 
>>>>> MV
>>>>> range. When we scan a base table range, we can generate the Merkle trees
>>>>> marked in red. When we scan an MV range, we can generate the Merkle trees
>>>>> marked in green. The cells that can be compared are marked in blue.
>>>>>
>>>>> To save time and CPU resources, we persist the Merkle trees created
>>>>> during a scan so we don’t need to regenerate them later. This way, when
>>>>> other nodes scan and build Merkle trees based on the same “frozen”
>>>>> snapshot, we can reuse the existing Merkle trees for comparison.
>>>>>
>>>>> On Fri, May 16, 2025 at 12:22 PM Runtian Liu <curly...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Unfortunately, no. When building Merkle trees for small token ranges
>>>>> in the base table, those ranges may span the entire MV token range. As a
>>>>> result, we need to scan the entire MV to generate all the necessary Merkle
>>>>> trees. For efficiency, we perform this as a single pass over the entire
>>>>> table rather than scanning a small range of the base or MV table
>>>>> individually. As you mentioned, with storage becoming increasingly
>>>>> affordable, this approach helps us save time and CPU resources.
>>>>>
>>>>> On Fri, May 16, 2025 at 12:11 PM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>> I spoke too soon - endless questions are not over :)
>>>>>
>>>>> Since the data that's going to be repaired only covers a range, I
>>>>> wonder if it makes sense to have the ability to issue a minimalist 
>>>>> snapshot
>>>>> that only hardlinks SSTables that are in a token range.  Based on what you
>>>>> (Runtian) have said above, only a small percentage of the data would
>>>>> actually be repaired at any given time.
>>>>>
>>>>> Just a thought to save a little filesystem churn.
>>>>>
>>>>>
>>>>> On Fri, May 16, 2025 at 10:55 AM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>> Nevermind about the height thing i guess its the same property.
>>>>>
>>>>> I’m done for now :)
>>>>>
>>>>> Thanks for entertaining my endless questions. My biggest concerns
>>>>> about repair have been alleviated.
>>>>>
>>>>> Jon
>>>>>
>>>>> On Fri, May 16, 2025 at 10:34 AM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>> Thats the critical bit i was missing, thank you Blake.
>>>>>
>>>>> I guess we’d need to have unlimited height trees then, since you’d
>>>>> need to be able to update the hashes of individual partitions, and we’d
>>>>> also need to propagate the hashes up every time as well. I’m curious what
>>>>> the cost will look like with that.
>>>>>
>>>>> At least it’s a cpu problem not an I/O one.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston <bl...@ultrablake.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>> The merkle tree xor's the individual row hashes together, which is
>>>>> commutative. So you should be able to build a tree in the view token order
>>>>> while reading in base table token order and vise versa.
>>>>>
>>>>> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
>>>>>
>>>>> Thanks for the explanation, I appreciate it.  I think you might still
>>>>> be glossing over an important point - which I'll make singularly here.
>>>>> There's a number of things I'm concerned about, but this is a big one.
>>>>>
>>>>> Calculating the hash of a partition for a Merkle tree needs to be done
>>>>> on the fully materialized, sorted partition.
>>>>>
>>>>> The examples you're giving are simple, to the point where they hide
>>>>> the problem.  Here's a better example, where the MV has a clustering
>>>>> column. In the MV's partition it'll have multiple rows, but in the base
>>>>> table it'll be stored in different pages or different SSTables entirely:
>>>>>
>>>>> CREATE TABLE test.t1 (
>>>>>     id int PRIMARY KEY,
>>>>>     v1 int
>>>>> );
>>>>>
>>>>> CREATE MATERIALIZED VIEW test.test_mv AS
>>>>>     SELECT v1, id
>>>>>     FROM test.t1
>>>>>     WHERE id IS NOT NULL AND v1 IS NOT NULL
>>>>>     PRIMARY KEY (v1, id)
>>>>>  WITH CLUSTERING ORDER BY (id ASC);
>>>>>
>>>>>
>>>>> Let's say we have some test data:
>>>>>
>>>>> cqlsh:test> select id, v1 from t1;
>>>>>
>>>>>  id | v1
>>>>> ----+----
>>>>>  10 | 11
>>>>>   1 | 14
>>>>>  19 | 10
>>>>>   2 | 14
>>>>>   3 | 14
>>>>>
>>>>> When we transform the data by iterating over the base table, we get
>>>>> this representation (note v1=14):
>>>>>
>>>>> cqlsh:test> select v1, id from t1;
>>>>>
>>>>>  v1 | id
>>>>> ----+----
>>>>>  11 | 10
>>>>>  14 |  1   <------
>>>>>  10 | 19
>>>>>  14 |  2 <------
>>>>>  14 |  3  <------
>>>>>
>>>>>
>>>>> The partiton key in the new table is v1.  If you simply iterate and
>>>>> transform and calculate merkle trees on the fly, you'll hit v1=14 with
>>>>> id=1, but you'll miss id=2 and id=3.  You need to get them all up front,
>>>>> and in sorted order, before you calculate the hash.  You actually need to
>>>>> transform the data to this, prior to calculating the tree:
>>>>>
>>>>> v1 | id
>>>>> ----+----
>>>>>  11 | 10
>>>>>  14 |  1, 2, 3
>>>>>  10 | 19
>>>>>
>>>>> Without an index you need to do one of the following over a dataset
>>>>> that's hundreds of GB:
>>>>>
>>>>> * for each partition, scan the entire range for all the data, then
>>>>> sort that partition in memory, then calculate the hash
>>>>> * collect the entire dataset in memory, transform and sort it
>>>>> * use a local index which has the keys already sorted
>>>>>
>>>>> A similar problem exists when trying to resolve the mismatches.
>>>>>
>>>>> Unless I'm missing some critical detail, I can't see how this will
>>>>> work without requiring nodes have hundreds of GB of RAM or we do several
>>>>> orders of magnitude more I/O than a normal repair.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 15, 2025 at 9:09 PM Runtian Liu <curly...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Thank you for the thoughtful questions, Jon. I really appreciate
>>>>> them—let me go through them one by one.
>>>>> ** *Do you intend on building all the Merkle trees in parallel?
>>>>>
>>>>> Since we take a snapshot to "freeze" the dataset, we don’t need to
>>>>> build all Merkle trees in parallel.
>>>>>
>>>>>
>>>>> * Will there be hundreds of files doing random IO to persist the trees
>>>>> to disk, in addition to the sequential IO from repair?
>>>>>
>>>>> The Merkle tree will only be persisted after the entire range scan is
>>>>> complete.
>>>>>
>>>>>
>>>>> * Is the intention of persisting the trees to disk to recover from
>>>>> failure, or just to limit memory usage?
>>>>>
>>>>> This is primarily to limit memory usage. As you may have noticed, MV
>>>>> repair needs to coordinate across the entire cluster rather than just a 
>>>>> few
>>>>> nodes. This process may take very long time and it may node may restart or
>>>>> do other operations during the time.
>>>>>
>>>>>
>>>>> ** *Have you calculated the Merkle tree space requirements?
>>>>> This is a very good question—I'll add it to the CEP as well. Each leaf
>>>>> node stores a 32-byte hash. With a tree depth of 15 (which is on the 
>>>>> higher
>>>>> end—smaller datasets might use fewer than 10 levels), a single Merkle tree
>>>>> would be approximately 32 × 2¹⁵ bytes, or 1 MB. If we split the tokens 
>>>>> into
>>>>> 10 ranges per node, we’ll end up with around 100 Merkle trees per node,
>>>>> totaling roughly 100 MB.
>>>>> * When do we build the Merkle trees for the view?  Is that happening
>>>>> in parallel with the base table?  Do we have the computational complexity
>>>>> of 2 full cluster repairs running simultaneously, or does it take twice as
>>>>> long?
>>>>>
>>>>> As mentioned earlier, this can be done in parallel with the base table
>>>>> or after building the base table’s Merkle tree, since we’re using a
>>>>> snapshot to “freeze” the data.
>>>>>
>>>>> > I'm very curious to hear if anyone has run a full cluster repair
>>>>> recently on a non-trivial dataset.  Every cluster I work with only does
>>>>> subrange repair.  I can't even recall the last time I did a full repair on
>>>>> a large cluster.  I may never have, now that I think about it.  Every time
>>>>> I've done this in the past it's been plagued with issues, both in terms of
>>>>> performance and reliability.  Subrange repair works because it can make
>>>>> progress in 15-30 minute increments.
>>>>> When we run a full repair, we trigger subrange repair on one node,
>>>>> then proceed to the next subrange, and continue this way until the node's
>>>>> entire primary range is repaired. After that, we move on to the next
>>>>> node—correct? The complexity comparison between full repair and the
>>>>> proposed MV repair is meant to compare the cost of repairing the entire
>>>>> dataset, not just a subrange.
>>>>>
>>>>> For the example you mentioned, let me explain how it works using the
>>>>> schema—without needing to create an index to build the Merkle trees.
>>>>>
>>>>> Suppose we have a node that owns the token range 1–30, and we have a
>>>>> few records in the base table and its corresponding MV:
>>>>>
>>>>>    -
>>>>>
>>>>>    Base table: (1, 1), (2, 11), (12, 1), (23, 1)
>>>>>    -
>>>>>
>>>>>    MV: (1, 1), (1, 12), (1, 23), (2, 11)
>>>>>
>>>>> When we run a full repair, we divide the node’s range into subranges
>>>>> of size 10, we have r=3 ranges in total.
>>>>>
>>>>> First, we repair the range (1–10). The records (1, 1) and (2, 11) fall
>>>>> into this range and are used to build the first Merkle tree, which is then
>>>>> compared with the corresponding tree from another replica.
>>>>>
>>>>> Next, we repair the range (11–20). Here, the record (12, 1) is used to
>>>>> build the second Merkle tree.
>>>>>
>>>>> Finally, we repair the range (21–30), using the record (23, 1) to
>>>>> build the third Merkle tree, which is again compared with a replica's
>>>>> version.
>>>>>
>>>>> In MV repair, we still use a subrange size of 10. The key difference
>>>>> is that each Merkle tree is responsible for data not just based on the 
>>>>> base
>>>>> table's partition key, but also on the MV's partition key.
>>>>>
>>>>> For example, when scanning the base table over the range (1–10):
>>>>>
>>>>>    -
>>>>>
>>>>>    In full repair, we generate one Merkle tree for that subrange.
>>>>>    -
>>>>>
>>>>>    In MV repair, we generate *r* = 3 Merkle trees, one for each MV
>>>>>    partition key range.
>>>>>
>>>>>
>>>>> This means the record (1, 1) will go into the first tree because the
>>>>> MV partition key is 1, while (2, 11) will go into the second tree because
>>>>> its MV key is 11. The third tree will be empty because there is no record
>>>>> with base table key in (1-10) and MV key in (20-30).
>>>>>
>>>>> After scanning the base table range (1–10), we proceed to the next
>>>>> range, (11–20), and again generate 3 Merkle trees, followed by the last
>>>>> range. This is why the total number of Merkle trees is *r²*—in this
>>>>> case, 9 trees need to be built for the entire table.
>>>>>
>>>>> A similar idea applies when scanning the MV to build Merkle trees.
>>>>> Essentially, for MV repair, each Merkle tree represents two-dimensional
>>>>> data, unlike normal repair where it only represents one dimension. Each
>>>>> Merkle tree represents the data that maps to Range(x) in the base table 
>>>>> and
>>>>> Range(y) in the MV.
>>>>>
>>>>>
>>>>> In full repair, tokens must be sorted when adding to the Merkle tree
>>>>> because the tree is built from the leaves—records are added sequentially
>>>>> from left to right.
>>>>> For MV repair, since the leaf nodes are sorted by the MV partition
>>>>> key, a base table row can be inserted into any leaf node. This means we
>>>>> must insert each hash starting from the root instead of directly at the
>>>>> leaf.
>>>>> As noted in the comparison table, this increases complexity:
>>>>>
>>>>>
>>>>>
>>>>>    -
>>>>>
>>>>>    In full repair, Merkle tree building is *O(1)* per row—each hash
>>>>>    is added sequentially to the leaf nodes.
>>>>>    -
>>>>>
>>>>>    In MV repair, each hash must be inserted from the root, making it
>>>>>    *O(d)* per row.
>>>>>
>>>>>
>>>>> Since *d* (the tree depth) is typically small—less than 20 and often
>>>>> smaller than in full repair—this added complexity isn’t a major concern in
>>>>> practice. The reason it is smaller than full repair is that, with the 
>>>>> above
>>>>> example, we use 3 trees to represent the same amount of data while full
>>>>> repair uses 1 tree.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Note that within each leaf node, the order in which hashes are added
>>>>> doesn’t matter. Cassandra repair currently enforces sorted input only to
>>>>> ensure that leaf nodes are built from left to right.
>>>>>
>>>>> > So let's say we find a mismatch in a hash.  That indicates that
>>>>> there's some range of data where we have an issue.  For some token range
>>>>> calculated from the v1 field, we have a mismatch, right?  What do we do
>>>>> with that information?
>>>>> With the above example being said, when we identify a range mismatch,
>>>>> it means we’ve found that data within the base table primary key range
>>>>> (a–b) and MV primary key range (m–n) has inconsistencies. We only need to
>>>>> rebuild this specific data.
>>>>> This allows us to easily locate the base table node that owns range
>>>>> (a–b) and rebuild only the affected MV partition key range (m–n).
>>>>>
>>>>> * Will there be coordination between all nodes in the cluster to
>>>>> ensure you don't have to do multiple scans?
>>>>>
>>>>>
>>>>> Yes, coordination is important for this type of repair. With the
>>>>> proposed solution, we can detect mismatches between the base table and the
>>>>> MV by scanning data from each of them just once.
>>>>> However, this doesn't mean all nodes need to be healthy during the
>>>>> repair. You can think of all the Merkle trees as forming a 2D matrix—if 
>>>>> one
>>>>> node is down, it corresponds to one row and one column being unavailable
>>>>> for comparison. The remaining cells can still be used for mismatch
>>>>> detection.
>>>>>
>>>>>
>>>>>
>>>>> Please don’t hesitate to let me know if anything is unclear or if you
>>>>> have any further questions or concerns—I’d be happy to discuss them.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Runtian
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 15, 2025 at 6:34 PM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>> One last thing.  I'm pretty sure building the tree requires the keys
>>>>> be added in token order:
>>>>> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173
>>>>>
>>>>> Which definitely introduces a bit of a problem, given that the tree
>>>>> would be constructed from the transformed v1, which is a value
>>>>> unpredictable enough to be considered random.
>>>>>
>>>>> The only way I can think of to address this would be to maintain a
>>>>> local index on v1.  See my previous email where I mentioned this.
>>>>>
>>>>> Base Table -> Local Index -> Global Index
>>>>>
>>>>> Still a really hard problem.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 15, 2025 at 6:12 PM Jon Haddad <j...@rustyrazorblade.com>
>>>>> wrote:
>>>>>
>>>>> There's a lot here that's still confusing to me.  Maybe you can help
>>>>> me understand it better?  Apologies in advance for the text wall :)
>>>>>
>>>>> I'll use this schema as an example:
>>>>>
>>>>> ---------
>>>>> CREATE TABLE test.t1 (
>>>>>     id int PRIMARY KEY,
>>>>>     v1 int
>>>>> );
>>>>>
>>>>> create MATERIALIZED VIEW  test_mv as
>>>>> SELECT v1, id from test.t1 where id is not null and v1 is not null
>>>>> primary key (v1, id);
>>>>> ---------
>>>>>
>>>>> We've got (id, v1) in the base table and (v1, id) in the MV.
>>>>>
>>>>> During the repair, we snapshot, and construct a whole bunch of merkle
>>>>> trees.  CEP-48 says they will be persisted to disk.
>>>>>
>>>>> ** *Do you intend on building all the Merkle trees in parallel?
>>>>> * Will there be hundreds of files doing random IO to persist the trees
>>>>> to disk, in addition to the sequential IO from repair?
>>>>> * Is the intention of persisting the trees to disk to recover from
>>>>> failure, or just to limit memory usage?
>>>>> ** *Have you calculated the Merkle tree space requirements?
>>>>> * When do we build the Merkle trees for the view?  Is that happening
>>>>> in parallel with the base table?  Do we have the computational complexity
>>>>> of 2 full cluster repairs running simultaneously, or does it take twice as
>>>>> long?
>>>>>
>>>>> I'm very curious to hear if anyone has run a full cluster repair
>>>>> recently on a non-trivial dataset.  Every cluster I work with only does
>>>>> subrange repair.  I can't even recall the last time I did a full repair on
>>>>> a large cluster.  I may never have, now that I think about it.  Every time
>>>>> I've done this in the past it's been plagued with issues, both in terms of
>>>>> performance and reliability.  Subrange repair works because it can make
>>>>> progress in 15-30 minute increments.
>>>>>
>>>>> Anyways - moving on...
>>>>>
>>>>> You suggest we read the base table and construct the Merkle trees
>>>>> based on the transformed rows. Using my schema above, we take the v1 field
>>>>> and use token(v1), to build the tree.  Assuming that a value for v1 
>>>>> appears
>>>>> many times throughout the dataset across many partitions, how do you 
>>>>> intend
>>>>> on calculating it's hash?  If you look at Validator.rowHash [1] and
>>>>> Validator.add, you'll see it's taking an UnfilteredRowIterator for an
>>>>> entire partition and calculates the hash based on that.  Here's the 
>>>>> comment:
>>>>>
>>>>>  /**
>>>>>      * Called (in order) for every row present in the CF.
>>>>>      * Hashes the row, and adds it to the tree being built.
>>>>>      *
>>>>>      * @param partition Partition to add hash
>>>>>      */
>>>>>     public void add(UnfilteredRowIterator partition)
>>>>>
>>>>> So it seems to me like you need to have the entire partition
>>>>> materialized in memory before adding to the tree.    Doing that per value
>>>>> v1 without an index is pretty much impossible - we'd have to scan the
>>>>> entire dataset once per partition to pull out all the matching v1 values,
>>>>> or you'd need to materialize the entire dataset into a local version of 
>>>>> the
>>>>> MV for that range. I don't know how you could do this.  Do you have a
>>>>> workaround for this planned?  Maybe someone that knows the Merkle tree 
>>>>> code
>>>>> better can chime in.
>>>>>
>>>>> Maybe there's something else here I'm not aware of - please let me
>>>>> know what I'm missing here if I am, it would be great to see this in the
>>>>> doc if you have a solution.
>>>>>
>>>>> For the sake of discussion, let's assume we've moved past this and we
>>>>> have our tree for a hundreds of ranges built from the base table & built
>>>>> for the MV, now we move onto the comparison.
>>>>>
>>>>> In the doc at this point, we delete the snapshot because we have the
>>>>> tree structures and we compare Merkle trees.  Then we stream mismatched
>>>>> data.
>>>>>
>>>>> So let's say we find a mismatch in a hash.  That indicates that
>>>>> there's some range of data where we have an issue.  For some token range
>>>>> calculated from the v1 field, we have a mismatch, right?  What do we do
>>>>> with that information?
>>>>>
>>>>> * Do we tell the node that owned the base table - hey, stream the data
>>>>> from base where token(v1) is in range [X,Y) to me?
>>>>> * That means we have to scan through the base again for all rows where
>>>>> token(v1) in [X,Y) range, right?  Because without an index on the hashes 
>>>>> of
>>>>> v1, we're doing a full table scan and hashing every v1 value to find out 
>>>>> if
>>>>> it needs to be streamed back to the MV.
>>>>> * Are we doing this concurrently on all nodes?
>>>>> * Will there be coordination between all nodes in the cluster to
>>>>> ensure you don't have to do multiple scans?
>>>>>
>>>>> I realized there's a lot of questions here, but unfortunately I'm
>>>>> having a hard time seeing how we can workaround some of the core
>>>>> assumptions around constructing Merkle trees and using them to resolve the
>>>>> differences in a way that matches up with what's in the doc.  I have quite
>>>>> a few more things to discuss, but I'll save them for a follow up once all
>>>>> these have been sorted out.
>>>>>
>>>>> Thanks in advance!
>>>>> Jon
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 15, 2025 at 10:10 AM Runtian Liu <curly...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> The previous table compared the complexity of full repair and MV
>>>>> repair when reconciling one dataset with another. In production, we
>>>>> typically use a replication factor of 3 in one datacenter. This means full
>>>>> repair involves 3n rows, while MV repair involves comparing 6n rows (base 
>>>>> +
>>>>> MV). Below is an updated comparison table reflecting this scenario.
>>>>>
>>>>> n: number of rows to repair (Total rows in the table)
>>>>>
>>>>> d: depth of one Merkle tree for MV repair
>>>>>
>>>>> r: number of split ranges
>>>>>
>>>>> p: data compacted away
>>>>>
>>>>>
>>>>> This comparison focuses on the complexities of one round of full
>>>>> repair with a replication factor of 3 versus repairing a single MV based 
>>>>> on
>>>>> one base table with replication factor 3.
>>>>>
>>>>> *Full Repair*
>>>>>
>>>>> *MV Repair*
>>>>>
>>>>> *Comment*
>>>>>
>>>>> Extra disk used
>>>>>
>>>>> 0
>>>>>
>>>>> O(2*p)
>>>>>
>>>>> Since we take a snapshot at the beginning of the repair, any disk
>>>>> space that would normally be freed by compaction will remain occupied 
>>>>> until
>>>>> the Merkle trees are successfully built and the snapshot is cleared.
>>>>>
>>>>> Data scan complexity
>>>>>
>>>>> O(3*n)
>>>>>
>>>>> O(6*n)
>>>>>
>>>>> Full repair scans *n* rows from the primary and 2*n* from replicas.3
>>>>>
>>>>> MV repair scans 3n rows from the base table and 3n from the MV.
>>>>>
>>>>> Merkle Tree building time complexity
>>>>>
>>>>> O(3n)
>>>>>
>>>>> O(6*n*d)
>>>>>
>>>>> In full repair, Merkle tree building is *O(1)* per row—each hash is
>>>>> added sequentially to the leaf nodes.
>>>>>
>>>>> In MV repair, each hash is inserted from the root, making it *O(d)*
>>>>> per row. Since *d* is typically small (less than 20 and often smaller
>>>>> than in full repair), this isn’t a major concern.
>>>>>
>>>>> Total Merkle tree count
>>>>>
>>>>> O(3*r)
>>>>>
>>>>> O(6*r^2)
>>>>>
>>>>> MV repair needs to generate more, smaller Merkle trees, but this isn’t
>>>>> a concern as they can be persisted to disk during the repair process.
>>>>>
>>>>> Merkle tree comparison complexity
>>>>>
>>>>> O(3n)
>>>>>
>>>>> O(3n)
>>>>>
>>>>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>>>>
>>>>> Stream time complexity
>>>>>
>>>>>

Reply via email to