I spoke too soon - endless questions are not over :)

Since the data that's going to be repaired only covers a range, I wonder if
it makes sense to have the ability to issue a minimalist snapshot that only
hardlinks SSTables that are in a token range.  Based on what you (Runtian)
have said above, only a small percentage of the data would actually be
repaired at any given time.

Just a thought to save a little filesystem churn.


On Fri, May 16, 2025 at 10:55 AM Jon Haddad <j...@rustyrazorblade.com> wrote:

> Nevermind about the height thing i guess its the same property.
>
> I’m done for now :)
>
> Thanks for entertaining my endless questions. My biggest concerns about
> repair have been alleviated.
>
> Jon
>
> On Fri, May 16, 2025 at 10:34 AM Jon Haddad <j...@rustyrazorblade.com>
> wrote:
>
>> Thats the critical bit i was missing, thank you Blake.
>>
>> I guess we’d need to have unlimited height trees then, since you’d need
>> to be able to update the hashes of individual partitions, and we’d also
>> need to propagate the hashes up every time as well. I’m curious what the
>> cost will look like with that.
>>
>> At least it’s a cpu problem not an I/O one.
>>
>> Jon
>>
>>
>> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston <bl...@ultrablake.com>
>> wrote:
>>
>>> The merkle tree xor's the individual row hashes together, which is
>>> commutative. So you should be able to build a tree in the view token order
>>> while reading in base table token order and vise versa.
>>>
>>> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote:
>>>
>>> Thanks for the explanation, I appreciate it.  I think you might still be
>>> glossing over an important point - which I'll make singularly here.
>>> There's a number of things I'm concerned about, but this is a big one.
>>>
>>> Calculating the hash of a partition for a Merkle tree needs to be done
>>> on the fully materialized, sorted partition.
>>>
>>> The examples you're giving are simple, to the point where they hide the
>>> problem.  Here's a better example, where the MV has a clustering column. In
>>> the MV's partition it'll have multiple rows, but in the base table it'll be
>>> stored in different pages or different SSTables entirely:
>>>
>>> CREATE TABLE test.t1 (
>>>     id int PRIMARY KEY,
>>>     v1 int
>>> );
>>>
>>> CREATE MATERIALIZED VIEW test.test_mv AS
>>>     SELECT v1, id
>>>     FROM test.t1
>>>     WHERE id IS NOT NULL AND v1 IS NOT NULL
>>>     PRIMARY KEY (v1, id)
>>>  WITH CLUSTERING ORDER BY (id ASC);
>>>
>>>
>>> Let's say we have some test data:
>>>
>>> cqlsh:test> select id, v1 from t1;
>>>
>>>  id | v1
>>> ----+----
>>>  10 | 11
>>>   1 | 14
>>>  19 | 10
>>>   2 | 14
>>>   3 | 14
>>>
>>> When we transform the data by iterating over the base table, we get this
>>> representation (note v1=14):
>>>
>>> cqlsh:test> select v1, id from t1;
>>>
>>>  v1 | id
>>> ----+----
>>>  11 | 10
>>>  14 |  1   <------
>>>  10 | 19
>>>  14 |  2 <------
>>>  14 |  3  <------
>>>
>>>
>>> The partiton key in the new table is v1.  If you simply iterate and
>>> transform and calculate merkle trees on the fly, you'll hit v1=14 with
>>> id=1, but you'll miss id=2 and id=3.  You need to get them all up front,
>>> and in sorted order, before you calculate the hash.  You actually need to
>>> transform the data to this, prior to calculating the tree:
>>>
>>> v1 | id
>>> ----+----
>>>  11 | 10
>>>  14 |  1, 2, 3
>>>  10 | 19
>>>
>>> Without an index you need to do one of the following over a dataset
>>> that's hundreds of GB:
>>>
>>> * for each partition, scan the entire range for all the data, then sort
>>> that partition in memory, then calculate the hash
>>> * collect the entire dataset in memory, transform and sort it
>>> * use a local index which has the keys already sorted
>>>
>>> A similar problem exists when trying to resolve the mismatches.
>>>
>>> Unless I'm missing some critical detail, I can't see how this will work
>>> without requiring nodes have hundreds of GB of RAM or we do several orders
>>> of magnitude more I/O than a normal repair.
>>>
>>> Jon
>>>
>>>
>>>
>>> On Thu, May 15, 2025 at 9:09 PM Runtian Liu <curly...@gmail.com> wrote:
>>>
>>> Thank you for the thoughtful questions, Jon. I really appreciate
>>> them—let me go through them one by one.
>>> ** *Do you intend on building all the Merkle trees in parallel?
>>>
>>> Since we take a snapshot to "freeze" the dataset, we don’t need to build
>>> all Merkle trees in parallel.
>>>
>>>
>>> * Will there be hundreds of files doing random IO to persist the trees
>>> to disk, in addition to the sequential IO from repair?
>>>
>>> The Merkle tree will only be persisted after the entire range scan is
>>> complete.
>>>
>>>
>>> * Is the intention of persisting the trees to disk to recover from
>>> failure, or just to limit memory usage?
>>>
>>> This is primarily to limit memory usage. As you may have noticed, MV
>>> repair needs to coordinate across the entire cluster rather than just a few
>>> nodes. This process may take very long time and it may node may restart or
>>> do other operations during the time.
>>>
>>>
>>> ** *Have you calculated the Merkle tree space requirements?
>>> This is a very good question—I'll add it to the CEP as well. Each leaf
>>> node stores a 32-byte hash. With a tree depth of 15 (which is on the higher
>>> end—smaller datasets might use fewer than 10 levels), a single Merkle tree
>>> would be approximately 32 × 2¹⁵ bytes, or 1 MB. If we split the tokens into
>>> 10 ranges per node, we’ll end up with around 100 Merkle trees per node,
>>> totaling roughly 100 MB.
>>> * When do we build the Merkle trees for the view?  Is that happening in
>>> parallel with the base table?  Do we have the computational complexity of 2
>>> full cluster repairs running simultaneously, or does it take twice as long?
>>>
>>> As mentioned earlier, this can be done in parallel with the base table
>>> or after building the base table’s Merkle tree, since we’re using a
>>> snapshot to “freeze” the data.
>>>
>>> > I'm very curious to hear if anyone has run a full cluster repair
>>> recently on a non-trivial dataset.  Every cluster I work with only does
>>> subrange repair.  I can't even recall the last time I did a full repair on
>>> a large cluster.  I may never have, now that I think about it.  Every time
>>> I've done this in the past it's been plagued with issues, both in terms of
>>> performance and reliability.  Subrange repair works because it can make
>>> progress in 15-30 minute increments.
>>> When we run a full repair, we trigger subrange repair on one node, then
>>> proceed to the next subrange, and continue this way until the node's entire
>>> primary range is repaired. After that, we move on to the next node—correct?
>>> The complexity comparison between full repair and the proposed MV repair is
>>> meant to compare the cost of repairing the entire dataset, not just a
>>> subrange.
>>>
>>> For the example you mentioned, let me explain how it works using the
>>> schema—without needing to create an index to build the Merkle trees.
>>>
>>> Suppose we have a node that owns the token range 1–30, and we have a few
>>> records in the base table and its corresponding MV:
>>>
>>>    -
>>>
>>>    Base table: (1, 1), (2, 11), (12, 1), (23, 1)
>>>    -
>>>
>>>    MV: (1, 1), (1, 12), (1, 23), (2, 11)
>>>
>>> When we run a full repair, we divide the node’s range into subranges of
>>> size 10, we have r=3 ranges in total.
>>>
>>> First, we repair the range (1–10). The records (1, 1) and (2, 11) fall
>>> into this range and are used to build the first Merkle tree, which is then
>>> compared with the corresponding tree from another replica.
>>>
>>> Next, we repair the range (11–20). Here, the record (12, 1) is used to
>>> build the second Merkle tree.
>>>
>>> Finally, we repair the range (21–30), using the record (23, 1) to build
>>> the third Merkle tree, which is again compared with a replica's version.
>>>
>>> In MV repair, we still use a subrange size of 10. The key difference is
>>> that each Merkle tree is responsible for data not just based on the base
>>> table's partition key, but also on the MV's partition key.
>>>
>>> For example, when scanning the base table over the range (1–10):
>>>
>>>    -
>>>
>>>    In full repair, we generate one Merkle tree for that subrange.
>>>    -
>>>
>>>    In MV repair, we generate *r* = 3 Merkle trees, one for each MV
>>>    partition key range.
>>>
>>>
>>> This means the record (1, 1) will go into the first tree because the MV
>>> partition key is 1, while (2, 11) will go into the second tree because its
>>> MV key is 11. The third tree will be empty because there is no record with
>>> base table key in (1-10) and MV key in (20-30).
>>>
>>> After scanning the base table range (1–10), we proceed to the next
>>> range, (11–20), and again generate 3 Merkle trees, followed by the last
>>> range. This is why the total number of Merkle trees is *r²*—in this
>>> case, 9 trees need to be built for the entire table.
>>>
>>> A similar idea applies when scanning the MV to build Merkle trees.
>>> Essentially, for MV repair, each Merkle tree represents two-dimensional
>>> data, unlike normal repair where it only represents one dimension. Each
>>> Merkle tree represents the data that maps to Range(x) in the base table and
>>> Range(y) in the MV.
>>>
>>>
>>> In full repair, tokens must be sorted when adding to the Merkle tree
>>> because the tree is built from the leaves—records are added sequentially
>>> from left to right.
>>> For MV repair, since the leaf nodes are sorted by the MV partition key,
>>> a base table row can be inserted into any leaf node. This means we must
>>> insert each hash starting from the root instead of directly at the leaf.
>>> As noted in the comparison table, this increases complexity:
>>>
>>>
>>>
>>>    -
>>>
>>>    In full repair, Merkle tree building is *O(1)* per row—each hash is
>>>    added sequentially to the leaf nodes.
>>>    -
>>>
>>>    In MV repair, each hash must be inserted from the root, making it
>>>    *O(d)* per row.
>>>
>>>
>>> Since *d* (the tree depth) is typically small—less than 20 and often
>>> smaller than in full repair—this added complexity isn’t a major concern in
>>> practice. The reason it is smaller than full repair is that, with the above
>>> example, we use 3 trees to represent the same amount of data while full
>>> repair uses 1 tree.
>>>
>>>
>>>
>>>
>>> Note that within each leaf node, the order in which hashes are added
>>> doesn’t matter. Cassandra repair currently enforces sorted input only to
>>> ensure that leaf nodes are built from left to right.
>>>
>>> > So let's say we find a mismatch in a hash.  That indicates that
>>> there's some range of data where we have an issue.  For some token range
>>> calculated from the v1 field, we have a mismatch, right?  What do we do
>>> with that information?
>>> With the above example being said, when we identify a range mismatch, it
>>> means we’ve found that data within the base table primary key range (a–b)
>>> and MV primary key range (m–n) has inconsistencies. We only need to rebuild
>>> this specific data.
>>> This allows us to easily locate the base table node that owns range
>>> (a–b) and rebuild only the affected MV partition key range (m–n).
>>>
>>> * Will there be coordination between all nodes in the cluster to ensure
>>> you don't have to do multiple scans?
>>>
>>>
>>> Yes, coordination is important for this type of repair. With the
>>> proposed solution, we can detect mismatches between the base table and the
>>> MV by scanning data from each of them just once.
>>> However, this doesn't mean all nodes need to be healthy during the
>>> repair. You can think of all the Merkle trees as forming a 2D matrix—if one
>>> node is down, it corresponds to one row and one column being unavailable
>>> for comparison. The remaining cells can still be used for mismatch
>>> detection.
>>>
>>>
>>>
>>> Please don’t hesitate to let me know if anything is unclear or if you
>>> have any further questions or concerns—I’d be happy to discuss them.
>>>
>>>
>>> Thanks,
>>>
>>> Runtian
>>>
>>>
>>>
>>>
>>> On Thu, May 15, 2025 at 6:34 PM Jon Haddad <j...@rustyrazorblade.com>
>>> wrote:
>>>
>>> One last thing.  I'm pretty sure building the tree requires the keys be
>>> added in token order:
>>> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173
>>>
>>> Which definitely introduces a bit of a problem, given that the tree
>>> would be constructed from the transformed v1, which is a value
>>> unpredictable enough to be considered random.
>>>
>>> The only way I can think of to address this would be to maintain a local
>>> index on v1.  See my previous email where I mentioned this.
>>>
>>> Base Table -> Local Index -> Global Index
>>>
>>> Still a really hard problem.
>>>
>>> Jon
>>>
>>>
>>>
>>> On Thu, May 15, 2025 at 6:12 PM Jon Haddad <j...@rustyrazorblade.com>
>>> wrote:
>>>
>>> There's a lot here that's still confusing to me.  Maybe you can help me
>>> understand it better?  Apologies in advance for the text wall :)
>>>
>>> I'll use this schema as an example:
>>>
>>> ---------
>>> CREATE TABLE test.t1 (
>>>     id int PRIMARY KEY,
>>>     v1 int
>>> );
>>>
>>> create MATERIALIZED VIEW  test_mv as
>>> SELECT v1, id from test.t1 where id is not null and v1 is not null
>>> primary key (v1, id);
>>> ---------
>>>
>>> We've got (id, v1) in the base table and (v1, id) in the MV.
>>>
>>> During the repair, we snapshot, and construct a whole bunch of merkle
>>> trees.  CEP-48 says they will be persisted to disk.
>>>
>>> ** *Do you intend on building all the Merkle trees in parallel?
>>> * Will there be hundreds of files doing random IO to persist the trees
>>> to disk, in addition to the sequential IO from repair?
>>> * Is the intention of persisting the trees to disk to recover from
>>> failure, or just to limit memory usage?
>>> ** *Have you calculated the Merkle tree space requirements?
>>> * When do we build the Merkle trees for the view?  Is that happening in
>>> parallel with the base table?  Do we have the computational complexity of 2
>>> full cluster repairs running simultaneously, or does it take twice as long?
>>>
>>> I'm very curious to hear if anyone has run a full cluster repair
>>> recently on a non-trivial dataset.  Every cluster I work with only does
>>> subrange repair.  I can't even recall the last time I did a full repair on
>>> a large cluster.  I may never have, now that I think about it.  Every time
>>> I've done this in the past it's been plagued with issues, both in terms of
>>> performance and reliability.  Subrange repair works because it can make
>>> progress in 15-30 minute increments.
>>>
>>> Anyways - moving on...
>>>
>>> You suggest we read the base table and construct the Merkle trees based
>>> on the transformed rows. Using my schema above, we take the v1 field and
>>> use token(v1), to build the tree.  Assuming that a value for v1 appears
>>> many times throughout the dataset across many partitions, how do you intend
>>> on calculating it's hash?  If you look at Validator.rowHash [1] and
>>> Validator.add, you'll see it's taking an UnfilteredRowIterator for an
>>> entire partition and calculates the hash based on that.  Here's the comment:
>>>
>>>  /**
>>>      * Called (in order) for every row present in the CF.
>>>      * Hashes the row, and adds it to the tree being built.
>>>      *
>>>      * @param partition Partition to add hash
>>>      */
>>>     public void add(UnfilteredRowIterator partition)
>>>
>>> So it seems to me like you need to have the entire partition
>>> materialized in memory before adding to the tree.    Doing that per value
>>> v1 without an index is pretty much impossible - we'd have to scan the
>>> entire dataset once per partition to pull out all the matching v1 values,
>>> or you'd need to materialize the entire dataset into a local version of the
>>> MV for that range. I don't know how you could do this.  Do you have a
>>> workaround for this planned?  Maybe someone that knows the Merkle tree code
>>> better can chime in.
>>>
>>> Maybe there's something else here I'm not aware of - please let me know
>>> what I'm missing here if I am, it would be great to see this in the doc if
>>> you have a solution.
>>>
>>> For the sake of discussion, let's assume we've moved past this and we
>>> have our tree for a hundreds of ranges built from the base table & built
>>> for the MV, now we move onto the comparison.
>>>
>>> In the doc at this point, we delete the snapshot because we have the
>>> tree structures and we compare Merkle trees.  Then we stream mismatched
>>> data.
>>>
>>> So let's say we find a mismatch in a hash.  That indicates that there's
>>> some range of data where we have an issue.  For some token range calculated
>>> from the v1 field, we have a mismatch, right?  What do we do with that
>>> information?
>>>
>>> * Do we tell the node that owned the base table - hey, stream the data
>>> from base where token(v1) is in range [X,Y) to me?
>>> * That means we have to scan through the base again for all rows where
>>> token(v1) in [X,Y) range, right?  Because without an index on the hashes of
>>> v1, we're doing a full table scan and hashing every v1 value to find out if
>>> it needs to be streamed back to the MV.
>>> * Are we doing this concurrently on all nodes?
>>> * Will there be coordination between all nodes in the cluster to ensure
>>> you don't have to do multiple scans?
>>>
>>> I realized there's a lot of questions here, but unfortunately I'm having
>>> a hard time seeing how we can workaround some of the core assumptions
>>> around constructing Merkle trees and using them to resolve the differences
>>> in a way that matches up with what's in the doc.  I have quite a few more
>>> things to discuss, but I'll save them for a follow up once all these have
>>> been sorted out.
>>>
>>> Thanks in advance!
>>> Jon
>>>
>>> [1]
>>> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209
>>>
>>>
>>>
>>> On Thu, May 15, 2025 at 10:10 AM Runtian Liu <curly...@gmail.com> wrote:
>>>
>>> The previous table compared the complexity of full repair and MV repair
>>> when reconciling one dataset with another. In production, we typically use
>>> a replication factor of 3 in one datacenter. This means full repair
>>> involves 3n rows, while MV repair involves comparing 6n rows (base + MV).
>>> Below is an updated comparison table reflecting this scenario.
>>>
>>> n: number of rows to repair (Total rows in the table)
>>>
>>> d: depth of one Merkle tree for MV repair
>>>
>>> r: number of split ranges
>>>
>>> p: data compacted away
>>>
>>>
>>> This comparison focuses on the complexities of one round of full repair
>>> with a replication factor of 3 versus repairing a single MV based on one
>>> base table with replication factor 3.
>>>
>>> *Full Repair*
>>>
>>> *MV Repair*
>>>
>>> *Comment*
>>>
>>> Extra disk used
>>>
>>> 0
>>>
>>> O(2*p)
>>>
>>> Since we take a snapshot at the beginning of the repair, any disk space
>>> that would normally be freed by compaction will remain occupied until the
>>> Merkle trees are successfully built and the snapshot is cleared.
>>>
>>> Data scan complexity
>>>
>>> O(3*n)
>>>
>>> O(6*n)
>>>
>>> Full repair scans *n* rows from the primary and 2*n* from replicas.3
>>>
>>> MV repair scans 3n rows from the base table and 3n from the MV.
>>>
>>> Merkle Tree building time complexity
>>>
>>> O(3n)
>>>
>>> O(6*n*d)
>>>
>>> In full repair, Merkle tree building is *O(1)* per row—each hash is
>>> added sequentially to the leaf nodes.
>>>
>>> In MV repair, each hash is inserted from the root, making it *O(d)* per
>>> row. Since *d* is typically small (less than 20 and often smaller than
>>> in full repair), this isn’t a major concern.
>>>
>>> Total Merkle tree count
>>>
>>> O(3*r)
>>>
>>> O(6*r^2)
>>>
>>> MV repair needs to generate more, smaller Merkle trees, but this isn’t a
>>> concern as they can be persisted to disk during the repair process.
>>>
>>> Merkle tree comparison complexity
>>>
>>> O(3n)
>>>
>>> O(3n)
>>>
>>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>>
>>> Stream time complexity
>>>
>>> O(3n)
>>>
>>> O(3n)
>>>
>>> Assuming all rows need to be streamed, both repairs are equivalent.
>>>
>>> *In short:* Even for production use cases having RF=3 in one data
>>> center, we can see that the MV repair consumes temporary disk space and a
>>> small, usually negligible amount of extra CPU for tree construction; other
>>> costs match full repair.
>>>
>>> Additionally, with the online path proposed in this CEP, we expect
>>> mismatches to be rare, which can lower the frequency of running this repair
>>> process compared to full repair.
>>>
>>>
>>> On Thu, May 15, 2025 at 9:53 AM Jon Haddad <j...@rustyrazorblade.com>
>>> wrote:
>>>
>>> > They are not two unordered sets, but rather two sets ordered by
>>> different keys.
>>>
>>> I think this is a distinction without a difference. Merkle tree repair
>>> works because the ordering of the data is mostly the same across nodes.
>>>
>>>
>>> On Thu, May 15, 2025 at 9:27 AM Runtian Liu <curly...@gmail.com> wrote:
>>>
>>> > what we're trying to achieve here is comparing two massive unordered
>>> sets.
>>>
>>> They are not two unordered sets, but rather two sets ordered by
>>> different keys. This means that when building Merkle trees for the base
>>> table and the materialized view (MV), we need to use different strategies
>>> to ensure the trees can be meaningfully compared.
>>>
>>> To address scalability concerns for MV repair, I’ve included a
>>> comparison between one round of full repair and MV repair in the table
>>> below. This comparison is also added to the CEP.
>>>
>>> n: number of rows to repair (Total rows in the table)
>>>
>>> d: depth of one Merkle tree for MV repair
>>>
>>> r: number of split ranges
>>>
>>> p: data compacted away
>>>
>>>
>>> This comparison focuses on the complexities of one round of full repair
>>> with a replication factor of 2 versus repairing a single MV based on one
>>> base table replica.
>>>
>>> *Full Repair*
>>>
>>> *MV Repair*
>>>
>>> *Comment*
>>>
>>> Extra disk used
>>>
>>> 0
>>>
>>> O(2*p)
>>>
>>> Since we take a snapshot at the beginning of the repair, any disk space
>>> that would normally be freed by compaction will remain occupied until the
>>> Merkle trees are successfully built and the snapshot is cleared.
>>>
>>> Data scan complexity
>>>
>>> O(2*n)
>>>
>>> O(2*n)
>>>
>>> Full repair scans *n* rows from the primary and *n* from replicas.
>>>
>>> MV repair scans *n* rows from the base table primary replica only, and
>>> *n* from the MV primary replica only.
>>>
>>> Merkle Tree building time complexity
>>>
>>> O(n)
>>>
>>> O(n*d)
>>>
>>> In full repair, Merkle tree building is *O(1)* per row—each hash is
>>> added sequentially to the leaf nodes.
>>>
>>> In MV repair, each hash is inserted from the root, making it *O(d)* per
>>> row. Since *d* is typically small (less than 20 and often smaller than
>>> in full repair), this isn’t a major concern.
>>>
>>> Total Merkle tree count
>>>
>>> O(2*r)
>>>
>>> O(2*r^2)
>>>
>>> MV repair needs to generate more, smaller Merkle trees, but this isn’t a
>>> concern as they can be persisted to disk during the repair process.
>>>
>>> Merkle tree comparison complexity
>>>
>>> O(n)
>>>
>>> O(n)
>>>
>>> Assuming one row maps to one leaf node, both repairs are equivalent.
>>>
>>> Stream time complexity
>>>
>>> O(n)
>>>
>>> O(n)
>>>
>>> Assuming all rows need to be streamed, both repairs are equivalent.
>>>
>>> *In short:* MV repair consumes temporary disk space and a small,
>>> usually negligible amount of extra CPU for tree construction; other costs
>>> match full repair.
>>>
>>> The core idea behind the proposed MV repair is as follows:
>>>
>>>    1.
>>>
>>>    Take a snapshot to “freeze” the current state of both the base table
>>>    and its MV.
>>>    2.
>>>
>>>    Gradually scan the data from both tables to build Merkle trees.
>>>    3.
>>>
>>>    Identify the token ranges where inconsistencies exist.
>>>    4.
>>>
>>>    Rebuild only the mismatched ranges rather than the entire MV.
>>>
>>> With transaction-backed MVs, step 4 should rarely be necessary.
>>>
>>> On Thu, May 15, 2025 at 7:54 AM Josh McKenzie <jmcken...@apache.org>
>>> wrote:
>>>
>>>
>>> I think in order to address this, the view should be propagated to the
>>> base replicas *after* it's accepted by all or a majority of base replicas.
>>> This is where I think mutation tracking could probably help.
>>>
>>> Yeah, the idea of "don't reflect in the MV until you hit the CL the user
>>> requested for the base table". Introduces disjoint risk if you have
>>> coordinator death mid-write where replicas got base-data but that 2nd step
>>> didn't take place; think that's why Runtien et. al are looking at paxos
>>> repair picking up those pieces for you after the fact to get you back into
>>> consistency. Mutation tracking and Accord both have similar guarantees in
>>> this space.
>>>
>>> I think this would ensure that as long as there's no data loss or
>>> bit-rot, the base and view can be repaired independently. When there is
>>> data loss or bit-rot in either the base table or the view, then it is the
>>> same as 2i today: rebuild is required.
>>>
>>> And the repair as proposed in the CEP should resolve the bitrot and bug
>>> dataloss case I think. Certainly has much higher time complexity but the
>>> bounding of memory complexity to be comparable with regular repair doesn't
>>> strike me as a dealbreaker.
>>>
>>> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote:
>>>
>>> >  I think requiring a rebuild is a deal breaker for most teams.  In
>>> most instances it would be having to also expand the cluster to handle the
>>> additional disk requirements.  It turns an inconsistency problem into a
>>> major operational headache that can take weeks to resolve.
>>>
>>> Agreed. The rebuild would not be required during normal operations when
>>> the cluster is properly maintained (ie. regular repair) - only in
>>> catastrophic situations.   This is also the case for ordinary tables
>>> currently: if there's data loss, then restoring from a backup is needed.
>>> This could be a possible alternative to not require a rebuild in this
>>> extraordinary scenario.
>>>
>>> On Thu, May 15, 2025 at 10:14 AM Jon Haddad <j...@rustyrazorblade.com>
>>> wrote:
>>>
>>> I think requiring a rebuild is a deal breaker for most teams.  In most
>>> instances it would be having to also expand the cluster to handle the
>>> additional disk requirements.  It turns an inconsistency problem into a
>>> major operational headache that can take weeks to resolve.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, May 15, 2025 at 7:02 AM Paulo Motta <pauloricard...@gmail.com>
>>> wrote:
>>>
>>> > There's bi-directional entropy issues with MV's - either orphaned view
>>> data or missing view data; that's why you kind of need a "bi-directional
>>> ETL" to make sure the 2 agree with each other. While normal repair would
>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>>> that's not in base table anymore" case, which afaict all base consistency
>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>>> to.
>>>
>>> I don't think that bi-directional reconciliation should be a
>>> requirement, when the base table is assumed to be the source of truth as
>>> stated in the CEP doc.
>>>
>>> I think the main issue with the current MV implementation is that each
>>> view replica is independently replicated by the base replica, before the
>>> base write is acknowledged.
>>>
>>> This creates a correctness issue in the write path, because a view
>>> update can be created for a write that was not accepted by the coordinator
>>> in the following scenario:
>>>
>>> N=RF=3
>>> CL=ONE
>>> - Update U is propagated to view replica V, coordinator that is also
>>> base replica B dies before accepting base table write request to client.
>>> Now U exists in V but not in B.
>>>
>>> I think in order to address this, the view should be propagated to the
>>> base replicas *after* it's accepted by all or a majority of base replicas.
>>> This is where I think mutation tracking could probably help.
>>>
>>> I think this would ensure that as long as there's no data loss or
>>> bit-rot, the base and view can be repaired independently. When there is
>>> data loss or bit-rot in either the base table or the view, then it is the
>>> same as 2i today: rebuild is required.
>>>
>>> >  It'd be correct (if operationally disappointing) to be able to just
>>> say "if you have data loss in your base table you need to rebuild the
>>> corresponding MV's", but the problem is operators aren't always going to
>>> know when that data loss occurs. Not everything is as visible as a lost
>>> quorum of replicas or blown up SSTables.
>>>
>>> I think there are opportunities to improve rebuild speed, assuming the
>>> base table as a source of truth. For example, rebuild only subranges when
>>> data-loss is detected.
>>>
>>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie <jmcken...@apache.org>
>>> wrote:
>>>
>>>
>>> There's bi-directional entropy issues with MV's - either orphaned view
>>> data or missing view data; that's why you kind of need a "bi-directional
>>> ETL" to make sure the 2 agree with each other. While normal repair would
>>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV
>>> that's not in base table anymore" case, which afaict all base consistency
>>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable
>>> to.
>>>
>>> It'd be correct (if operationally disappointing) to be able to just say
>>> "if you have data loss in your base table you need to rebuild the
>>> corresponding MV's", but the problem is operators aren't always going to
>>> know when that data loss occurs. Not everything is as visible as a lost
>>> quorum of replicas or blown up SSTables.
>>>
>>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote:
>>>
>>> Maybe, I’m not really familiar enough with how “classic” MV repair works
>>> to say. You can’t mix normal repair and mutation reconciliation in the
>>> current incarnation of mutation tracking though, so I wouldn’t assume it
>>> would work with MVs.
>>>
>>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote:
>>>
>>> In the case of bitrot / losing an SSTable, wouldn't a normal repair
>>> (just the MV against the other nodes) resolve the issue?
>>>
>>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com>
>>> wrote:
>>>
>>>
>>> Mutation tracking is definitely an approach you could take for MVs.
>>> Mutation reconciliation could be extended to ensure all changes have been
>>> replicated to the views. When a base table received a mutation w/ an id it
>>> would generate a view update. If you block marking a given mutation id as
>>> reconciled until it’s been fully replicated to the base table and its view
>>> updates have been fully replicated to the views, then all view updates will
>>> eventually be applied as part of the log reconciliation process.
>>>
>>> A mutation tracking implementation would also allow you to be more
>>> flexible with the types of consistency levels you can work with, allowing
>>> users to do things like use LOCAL_QUORUM without leaving themselves open to
>>> introducing view inconsistencies.
>>>
>>> That would more or less eliminate the need for any MV repair in normal
>>> usage, but wouldn't address how to repair issues caused by bugs or data
>>> loss, though you may be able to do something with comparing the latest
>>> mutation ids for the base tables and its view ranges.
>>>
>>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote:
>>>
>>> I don't see mutation tracking [1] mentioned in this thread or in the
>>> CEP-48 description. Not sure this would fit into the scope of this
>>> initial CEP, but I have a feeling that mutation tracking could be
>>> potentially helpful to reconcile base tables and views ?
>>>
>>> For example, when both base and view updates are acknowledged then this
>>> could be somehow persisted in the view sstables mutation tracking
>>> summary[2] or similar metadata ? Then these updates would be skipped during
>>> view repair, considerably reducing the amount of work needed, since only
>>> un-acknowledged views updates would need to be reconciled.
>>>
>>> [1] -
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking|
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C>
>>> [2] -  <https://issues.apache.org/jira/browse/CASSANDRA-20336>
>>>
>>>

Reply via email to