I spoke too soon - endless questions are not over :) Since the data that's going to be repaired only covers a range, I wonder if it makes sense to have the ability to issue a minimalist snapshot that only hardlinks SSTables that are in a token range. Based on what you (Runtian) have said above, only a small percentage of the data would actually be repaired at any given time.
Just a thought to save a little filesystem churn. On Fri, May 16, 2025 at 10:55 AM Jon Haddad <j...@rustyrazorblade.com> wrote: > Nevermind about the height thing i guess its the same property. > > I’m done for now :) > > Thanks for entertaining my endless questions. My biggest concerns about > repair have been alleviated. > > Jon > > On Fri, May 16, 2025 at 10:34 AM Jon Haddad <j...@rustyrazorblade.com> > wrote: > >> Thats the critical bit i was missing, thank you Blake. >> >> I guess we’d need to have unlimited height trees then, since you’d need >> to be able to update the hashes of individual partitions, and we’d also >> need to propagate the hashes up every time as well. I’m curious what the >> cost will look like with that. >> >> At least it’s a cpu problem not an I/O one. >> >> Jon >> >> >> On Fri, May 16, 2025 at 10:04 AM Blake Eggleston <bl...@ultrablake.com> >> wrote: >> >>> The merkle tree xor's the individual row hashes together, which is >>> commutative. So you should be able to build a tree in the view token order >>> while reading in base table token order and vise versa. >>> >>> On Fri, May 16, 2025, at 9:54 AM, Jon Haddad wrote: >>> >>> Thanks for the explanation, I appreciate it. I think you might still be >>> glossing over an important point - which I'll make singularly here. >>> There's a number of things I'm concerned about, but this is a big one. >>> >>> Calculating the hash of a partition for a Merkle tree needs to be done >>> on the fully materialized, sorted partition. >>> >>> The examples you're giving are simple, to the point where they hide the >>> problem. Here's a better example, where the MV has a clustering column. In >>> the MV's partition it'll have multiple rows, but in the base table it'll be >>> stored in different pages or different SSTables entirely: >>> >>> CREATE TABLE test.t1 ( >>> id int PRIMARY KEY, >>> v1 int >>> ); >>> >>> CREATE MATERIALIZED VIEW test.test_mv AS >>> SELECT v1, id >>> FROM test.t1 >>> WHERE id IS NOT NULL AND v1 IS NOT NULL >>> PRIMARY KEY (v1, id) >>> WITH CLUSTERING ORDER BY (id ASC); >>> >>> >>> Let's say we have some test data: >>> >>> cqlsh:test> select id, v1 from t1; >>> >>> id | v1 >>> ----+---- >>> 10 | 11 >>> 1 | 14 >>> 19 | 10 >>> 2 | 14 >>> 3 | 14 >>> >>> When we transform the data by iterating over the base table, we get this >>> representation (note v1=14): >>> >>> cqlsh:test> select v1, id from t1; >>> >>> v1 | id >>> ----+---- >>> 11 | 10 >>> 14 | 1 <------ >>> 10 | 19 >>> 14 | 2 <------ >>> 14 | 3 <------ >>> >>> >>> The partiton key in the new table is v1. If you simply iterate and >>> transform and calculate merkle trees on the fly, you'll hit v1=14 with >>> id=1, but you'll miss id=2 and id=3. You need to get them all up front, >>> and in sorted order, before you calculate the hash. You actually need to >>> transform the data to this, prior to calculating the tree: >>> >>> v1 | id >>> ----+---- >>> 11 | 10 >>> 14 | 1, 2, 3 >>> 10 | 19 >>> >>> Without an index you need to do one of the following over a dataset >>> that's hundreds of GB: >>> >>> * for each partition, scan the entire range for all the data, then sort >>> that partition in memory, then calculate the hash >>> * collect the entire dataset in memory, transform and sort it >>> * use a local index which has the keys already sorted >>> >>> A similar problem exists when trying to resolve the mismatches. >>> >>> Unless I'm missing some critical detail, I can't see how this will work >>> without requiring nodes have hundreds of GB of RAM or we do several orders >>> of magnitude more I/O than a normal repair. >>> >>> Jon >>> >>> >>> >>> On Thu, May 15, 2025 at 9:09 PM Runtian Liu <curly...@gmail.com> wrote: >>> >>> Thank you for the thoughtful questions, Jon. I really appreciate >>> them—let me go through them one by one. >>> ** *Do you intend on building all the Merkle trees in parallel? >>> >>> Since we take a snapshot to "freeze" the dataset, we don’t need to build >>> all Merkle trees in parallel. >>> >>> >>> * Will there be hundreds of files doing random IO to persist the trees >>> to disk, in addition to the sequential IO from repair? >>> >>> The Merkle tree will only be persisted after the entire range scan is >>> complete. >>> >>> >>> * Is the intention of persisting the trees to disk to recover from >>> failure, or just to limit memory usage? >>> >>> This is primarily to limit memory usage. As you may have noticed, MV >>> repair needs to coordinate across the entire cluster rather than just a few >>> nodes. This process may take very long time and it may node may restart or >>> do other operations during the time. >>> >>> >>> ** *Have you calculated the Merkle tree space requirements? >>> This is a very good question—I'll add it to the CEP as well. Each leaf >>> node stores a 32-byte hash. With a tree depth of 15 (which is on the higher >>> end—smaller datasets might use fewer than 10 levels), a single Merkle tree >>> would be approximately 32 × 2¹⁵ bytes, or 1 MB. If we split the tokens into >>> 10 ranges per node, we’ll end up with around 100 Merkle trees per node, >>> totaling roughly 100 MB. >>> * When do we build the Merkle trees for the view? Is that happening in >>> parallel with the base table? Do we have the computational complexity of 2 >>> full cluster repairs running simultaneously, or does it take twice as long? >>> >>> As mentioned earlier, this can be done in parallel with the base table >>> or after building the base table’s Merkle tree, since we’re using a >>> snapshot to “freeze” the data. >>> >>> > I'm very curious to hear if anyone has run a full cluster repair >>> recently on a non-trivial dataset. Every cluster I work with only does >>> subrange repair. I can't even recall the last time I did a full repair on >>> a large cluster. I may never have, now that I think about it. Every time >>> I've done this in the past it's been plagued with issues, both in terms of >>> performance and reliability. Subrange repair works because it can make >>> progress in 15-30 minute increments. >>> When we run a full repair, we trigger subrange repair on one node, then >>> proceed to the next subrange, and continue this way until the node's entire >>> primary range is repaired. After that, we move on to the next node—correct? >>> The complexity comparison between full repair and the proposed MV repair is >>> meant to compare the cost of repairing the entire dataset, not just a >>> subrange. >>> >>> For the example you mentioned, let me explain how it works using the >>> schema—without needing to create an index to build the Merkle trees. >>> >>> Suppose we have a node that owns the token range 1–30, and we have a few >>> records in the base table and its corresponding MV: >>> >>> - >>> >>> Base table: (1, 1), (2, 11), (12, 1), (23, 1) >>> - >>> >>> MV: (1, 1), (1, 12), (1, 23), (2, 11) >>> >>> When we run a full repair, we divide the node’s range into subranges of >>> size 10, we have r=3 ranges in total. >>> >>> First, we repair the range (1–10). The records (1, 1) and (2, 11) fall >>> into this range and are used to build the first Merkle tree, which is then >>> compared with the corresponding tree from another replica. >>> >>> Next, we repair the range (11–20). Here, the record (12, 1) is used to >>> build the second Merkle tree. >>> >>> Finally, we repair the range (21–30), using the record (23, 1) to build >>> the third Merkle tree, which is again compared with a replica's version. >>> >>> In MV repair, we still use a subrange size of 10. The key difference is >>> that each Merkle tree is responsible for data not just based on the base >>> table's partition key, but also on the MV's partition key. >>> >>> For example, when scanning the base table over the range (1–10): >>> >>> - >>> >>> In full repair, we generate one Merkle tree for that subrange. >>> - >>> >>> In MV repair, we generate *r* = 3 Merkle trees, one for each MV >>> partition key range. >>> >>> >>> This means the record (1, 1) will go into the first tree because the MV >>> partition key is 1, while (2, 11) will go into the second tree because its >>> MV key is 11. The third tree will be empty because there is no record with >>> base table key in (1-10) and MV key in (20-30). >>> >>> After scanning the base table range (1–10), we proceed to the next >>> range, (11–20), and again generate 3 Merkle trees, followed by the last >>> range. This is why the total number of Merkle trees is *r²*—in this >>> case, 9 trees need to be built for the entire table. >>> >>> A similar idea applies when scanning the MV to build Merkle trees. >>> Essentially, for MV repair, each Merkle tree represents two-dimensional >>> data, unlike normal repair where it only represents one dimension. Each >>> Merkle tree represents the data that maps to Range(x) in the base table and >>> Range(y) in the MV. >>> >>> >>> In full repair, tokens must be sorted when adding to the Merkle tree >>> because the tree is built from the leaves—records are added sequentially >>> from left to right. >>> For MV repair, since the leaf nodes are sorted by the MV partition key, >>> a base table row can be inserted into any leaf node. This means we must >>> insert each hash starting from the root instead of directly at the leaf. >>> As noted in the comparison table, this increases complexity: >>> >>> >>> >>> - >>> >>> In full repair, Merkle tree building is *O(1)* per row—each hash is >>> added sequentially to the leaf nodes. >>> - >>> >>> In MV repair, each hash must be inserted from the root, making it >>> *O(d)* per row. >>> >>> >>> Since *d* (the tree depth) is typically small—less than 20 and often >>> smaller than in full repair—this added complexity isn’t a major concern in >>> practice. The reason it is smaller than full repair is that, with the above >>> example, we use 3 trees to represent the same amount of data while full >>> repair uses 1 tree. >>> >>> >>> >>> >>> Note that within each leaf node, the order in which hashes are added >>> doesn’t matter. Cassandra repair currently enforces sorted input only to >>> ensure that leaf nodes are built from left to right. >>> >>> > So let's say we find a mismatch in a hash. That indicates that >>> there's some range of data where we have an issue. For some token range >>> calculated from the v1 field, we have a mismatch, right? What do we do >>> with that information? >>> With the above example being said, when we identify a range mismatch, it >>> means we’ve found that data within the base table primary key range (a–b) >>> and MV primary key range (m–n) has inconsistencies. We only need to rebuild >>> this specific data. >>> This allows us to easily locate the base table node that owns range >>> (a–b) and rebuild only the affected MV partition key range (m–n). >>> >>> * Will there be coordination between all nodes in the cluster to ensure >>> you don't have to do multiple scans? >>> >>> >>> Yes, coordination is important for this type of repair. With the >>> proposed solution, we can detect mismatches between the base table and the >>> MV by scanning data from each of them just once. >>> However, this doesn't mean all nodes need to be healthy during the >>> repair. You can think of all the Merkle trees as forming a 2D matrix—if one >>> node is down, it corresponds to one row and one column being unavailable >>> for comparison. The remaining cells can still be used for mismatch >>> detection. >>> >>> >>> >>> Please don’t hesitate to let me know if anything is unclear or if you >>> have any further questions or concerns—I’d be happy to discuss them. >>> >>> >>> Thanks, >>> >>> Runtian >>> >>> >>> >>> >>> On Thu, May 15, 2025 at 6:34 PM Jon Haddad <j...@rustyrazorblade.com> >>> wrote: >>> >>> One last thing. I'm pretty sure building the tree requires the keys be >>> added in token order: >>> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L173 >>> >>> Which definitely introduces a bit of a problem, given that the tree >>> would be constructed from the transformed v1, which is a value >>> unpredictable enough to be considered random. >>> >>> The only way I can think of to address this would be to maintain a local >>> index on v1. See my previous email where I mentioned this. >>> >>> Base Table -> Local Index -> Global Index >>> >>> Still a really hard problem. >>> >>> Jon >>> >>> >>> >>> On Thu, May 15, 2025 at 6:12 PM Jon Haddad <j...@rustyrazorblade.com> >>> wrote: >>> >>> There's a lot here that's still confusing to me. Maybe you can help me >>> understand it better? Apologies in advance for the text wall :) >>> >>> I'll use this schema as an example: >>> >>> --------- >>> CREATE TABLE test.t1 ( >>> id int PRIMARY KEY, >>> v1 int >>> ); >>> >>> create MATERIALIZED VIEW test_mv as >>> SELECT v1, id from test.t1 where id is not null and v1 is not null >>> primary key (v1, id); >>> --------- >>> >>> We've got (id, v1) in the base table and (v1, id) in the MV. >>> >>> During the repair, we snapshot, and construct a whole bunch of merkle >>> trees. CEP-48 says they will be persisted to disk. >>> >>> ** *Do you intend on building all the Merkle trees in parallel? >>> * Will there be hundreds of files doing random IO to persist the trees >>> to disk, in addition to the sequential IO from repair? >>> * Is the intention of persisting the trees to disk to recover from >>> failure, or just to limit memory usage? >>> ** *Have you calculated the Merkle tree space requirements? >>> * When do we build the Merkle trees for the view? Is that happening in >>> parallel with the base table? Do we have the computational complexity of 2 >>> full cluster repairs running simultaneously, or does it take twice as long? >>> >>> I'm very curious to hear if anyone has run a full cluster repair >>> recently on a non-trivial dataset. Every cluster I work with only does >>> subrange repair. I can't even recall the last time I did a full repair on >>> a large cluster. I may never have, now that I think about it. Every time >>> I've done this in the past it's been plagued with issues, both in terms of >>> performance and reliability. Subrange repair works because it can make >>> progress in 15-30 minute increments. >>> >>> Anyways - moving on... >>> >>> You suggest we read the base table and construct the Merkle trees based >>> on the transformed rows. Using my schema above, we take the v1 field and >>> use token(v1), to build the tree. Assuming that a value for v1 appears >>> many times throughout the dataset across many partitions, how do you intend >>> on calculating it's hash? If you look at Validator.rowHash [1] and >>> Validator.add, you'll see it's taking an UnfilteredRowIterator for an >>> entire partition and calculates the hash based on that. Here's the comment: >>> >>> /** >>> * Called (in order) for every row present in the CF. >>> * Hashes the row, and adds it to the tree being built. >>> * >>> * @param partition Partition to add hash >>> */ >>> public void add(UnfilteredRowIterator partition) >>> >>> So it seems to me like you need to have the entire partition >>> materialized in memory before adding to the tree. Doing that per value >>> v1 without an index is pretty much impossible - we'd have to scan the >>> entire dataset once per partition to pull out all the matching v1 values, >>> or you'd need to materialize the entire dataset into a local version of the >>> MV for that range. I don't know how you could do this. Do you have a >>> workaround for this planned? Maybe someone that knows the Merkle tree code >>> better can chime in. >>> >>> Maybe there's something else here I'm not aware of - please let me know >>> what I'm missing here if I am, it would be great to see this in the doc if >>> you have a solution. >>> >>> For the sake of discussion, let's assume we've moved past this and we >>> have our tree for a hundreds of ranges built from the base table & built >>> for the MV, now we move onto the comparison. >>> >>> In the doc at this point, we delete the snapshot because we have the >>> tree structures and we compare Merkle trees. Then we stream mismatched >>> data. >>> >>> So let's say we find a mismatch in a hash. That indicates that there's >>> some range of data where we have an issue. For some token range calculated >>> from the v1 field, we have a mismatch, right? What do we do with that >>> information? >>> >>> * Do we tell the node that owned the base table - hey, stream the data >>> from base where token(v1) is in range [X,Y) to me? >>> * That means we have to scan through the base again for all rows where >>> token(v1) in [X,Y) range, right? Because without an index on the hashes of >>> v1, we're doing a full table scan and hashing every v1 value to find out if >>> it needs to be streamed back to the MV. >>> * Are we doing this concurrently on all nodes? >>> * Will there be coordination between all nodes in the cluster to ensure >>> you don't have to do multiple scans? >>> >>> I realized there's a lot of questions here, but unfortunately I'm having >>> a hard time seeing how we can workaround some of the core assumptions >>> around constructing Merkle trees and using them to resolve the differences >>> in a way that matches up with what's in the doc. I have quite a few more >>> things to discuss, but I'll save them for a follow up once all these have >>> been sorted out. >>> >>> Thanks in advance! >>> Jon >>> >>> [1] >>> https://github.com/apache/cassandra/blob/08946652434edbce38a6395e71d4068898ea13fa/src/java/org/apache/cassandra/repair/Validator.java#L209 >>> >>> >>> >>> On Thu, May 15, 2025 at 10:10 AM Runtian Liu <curly...@gmail.com> wrote: >>> >>> The previous table compared the complexity of full repair and MV repair >>> when reconciling one dataset with another. In production, we typically use >>> a replication factor of 3 in one datacenter. This means full repair >>> involves 3n rows, while MV repair involves comparing 6n rows (base + MV). >>> Below is an updated comparison table reflecting this scenario. >>> >>> n: number of rows to repair (Total rows in the table) >>> >>> d: depth of one Merkle tree for MV repair >>> >>> r: number of split ranges >>> >>> p: data compacted away >>> >>> >>> This comparison focuses on the complexities of one round of full repair >>> with a replication factor of 3 versus repairing a single MV based on one >>> base table with replication factor 3. >>> >>> *Full Repair* >>> >>> *MV Repair* >>> >>> *Comment* >>> >>> Extra disk used >>> >>> 0 >>> >>> O(2*p) >>> >>> Since we take a snapshot at the beginning of the repair, any disk space >>> that would normally be freed by compaction will remain occupied until the >>> Merkle trees are successfully built and the snapshot is cleared. >>> >>> Data scan complexity >>> >>> O(3*n) >>> >>> O(6*n) >>> >>> Full repair scans *n* rows from the primary and 2*n* from replicas.3 >>> >>> MV repair scans 3n rows from the base table and 3n from the MV. >>> >>> Merkle Tree building time complexity >>> >>> O(3n) >>> >>> O(6*n*d) >>> >>> In full repair, Merkle tree building is *O(1)* per row—each hash is >>> added sequentially to the leaf nodes. >>> >>> In MV repair, each hash is inserted from the root, making it *O(d)* per >>> row. Since *d* is typically small (less than 20 and often smaller than >>> in full repair), this isn’t a major concern. >>> >>> Total Merkle tree count >>> >>> O(3*r) >>> >>> O(6*r^2) >>> >>> MV repair needs to generate more, smaller Merkle trees, but this isn’t a >>> concern as they can be persisted to disk during the repair process. >>> >>> Merkle tree comparison complexity >>> >>> O(3n) >>> >>> O(3n) >>> >>> Assuming one row maps to one leaf node, both repairs are equivalent. >>> >>> Stream time complexity >>> >>> O(3n) >>> >>> O(3n) >>> >>> Assuming all rows need to be streamed, both repairs are equivalent. >>> >>> *In short:* Even for production use cases having RF=3 in one data >>> center, we can see that the MV repair consumes temporary disk space and a >>> small, usually negligible amount of extra CPU for tree construction; other >>> costs match full repair. >>> >>> Additionally, with the online path proposed in this CEP, we expect >>> mismatches to be rare, which can lower the frequency of running this repair >>> process compared to full repair. >>> >>> >>> On Thu, May 15, 2025 at 9:53 AM Jon Haddad <j...@rustyrazorblade.com> >>> wrote: >>> >>> > They are not two unordered sets, but rather two sets ordered by >>> different keys. >>> >>> I think this is a distinction without a difference. Merkle tree repair >>> works because the ordering of the data is mostly the same across nodes. >>> >>> >>> On Thu, May 15, 2025 at 9:27 AM Runtian Liu <curly...@gmail.com> wrote: >>> >>> > what we're trying to achieve here is comparing two massive unordered >>> sets. >>> >>> They are not two unordered sets, but rather two sets ordered by >>> different keys. This means that when building Merkle trees for the base >>> table and the materialized view (MV), we need to use different strategies >>> to ensure the trees can be meaningfully compared. >>> >>> To address scalability concerns for MV repair, I’ve included a >>> comparison between one round of full repair and MV repair in the table >>> below. This comparison is also added to the CEP. >>> >>> n: number of rows to repair (Total rows in the table) >>> >>> d: depth of one Merkle tree for MV repair >>> >>> r: number of split ranges >>> >>> p: data compacted away >>> >>> >>> This comparison focuses on the complexities of one round of full repair >>> with a replication factor of 2 versus repairing a single MV based on one >>> base table replica. >>> >>> *Full Repair* >>> >>> *MV Repair* >>> >>> *Comment* >>> >>> Extra disk used >>> >>> 0 >>> >>> O(2*p) >>> >>> Since we take a snapshot at the beginning of the repair, any disk space >>> that would normally be freed by compaction will remain occupied until the >>> Merkle trees are successfully built and the snapshot is cleared. >>> >>> Data scan complexity >>> >>> O(2*n) >>> >>> O(2*n) >>> >>> Full repair scans *n* rows from the primary and *n* from replicas. >>> >>> MV repair scans *n* rows from the base table primary replica only, and >>> *n* from the MV primary replica only. >>> >>> Merkle Tree building time complexity >>> >>> O(n) >>> >>> O(n*d) >>> >>> In full repair, Merkle tree building is *O(1)* per row—each hash is >>> added sequentially to the leaf nodes. >>> >>> In MV repair, each hash is inserted from the root, making it *O(d)* per >>> row. Since *d* is typically small (less than 20 and often smaller than >>> in full repair), this isn’t a major concern. >>> >>> Total Merkle tree count >>> >>> O(2*r) >>> >>> O(2*r^2) >>> >>> MV repair needs to generate more, smaller Merkle trees, but this isn’t a >>> concern as they can be persisted to disk during the repair process. >>> >>> Merkle tree comparison complexity >>> >>> O(n) >>> >>> O(n) >>> >>> Assuming one row maps to one leaf node, both repairs are equivalent. >>> >>> Stream time complexity >>> >>> O(n) >>> >>> O(n) >>> >>> Assuming all rows need to be streamed, both repairs are equivalent. >>> >>> *In short:* MV repair consumes temporary disk space and a small, >>> usually negligible amount of extra CPU for tree construction; other costs >>> match full repair. >>> >>> The core idea behind the proposed MV repair is as follows: >>> >>> 1. >>> >>> Take a snapshot to “freeze” the current state of both the base table >>> and its MV. >>> 2. >>> >>> Gradually scan the data from both tables to build Merkle trees. >>> 3. >>> >>> Identify the token ranges where inconsistencies exist. >>> 4. >>> >>> Rebuild only the mismatched ranges rather than the entire MV. >>> >>> With transaction-backed MVs, step 4 should rarely be necessary. >>> >>> On Thu, May 15, 2025 at 7:54 AM Josh McKenzie <jmcken...@apache.org> >>> wrote: >>> >>> >>> I think in order to address this, the view should be propagated to the >>> base replicas *after* it's accepted by all or a majority of base replicas. >>> This is where I think mutation tracking could probably help. >>> >>> Yeah, the idea of "don't reflect in the MV until you hit the CL the user >>> requested for the base table". Introduces disjoint risk if you have >>> coordinator death mid-write where replicas got base-data but that 2nd step >>> didn't take place; think that's why Runtien et. al are looking at paxos >>> repair picking up those pieces for you after the fact to get you back into >>> consistency. Mutation tracking and Accord both have similar guarantees in >>> this space. >>> >>> I think this would ensure that as long as there's no data loss or >>> bit-rot, the base and view can be repaired independently. When there is >>> data loss or bit-rot in either the base table or the view, then it is the >>> same as 2i today: rebuild is required. >>> >>> And the repair as proposed in the CEP should resolve the bitrot and bug >>> dataloss case I think. Certainly has much higher time complexity but the >>> bounding of memory complexity to be comparable with regular repair doesn't >>> strike me as a dealbreaker. >>> >>> On Thu, May 15, 2025, at 10:24 AM, Paulo Motta wrote: >>> >>> > I think requiring a rebuild is a deal breaker for most teams. In >>> most instances it would be having to also expand the cluster to handle the >>> additional disk requirements. It turns an inconsistency problem into a >>> major operational headache that can take weeks to resolve. >>> >>> Agreed. The rebuild would not be required during normal operations when >>> the cluster is properly maintained (ie. regular repair) - only in >>> catastrophic situations. This is also the case for ordinary tables >>> currently: if there's data loss, then restoring from a backup is needed. >>> This could be a possible alternative to not require a rebuild in this >>> extraordinary scenario. >>> >>> On Thu, May 15, 2025 at 10:14 AM Jon Haddad <j...@rustyrazorblade.com> >>> wrote: >>> >>> I think requiring a rebuild is a deal breaker for most teams. In most >>> instances it would be having to also expand the cluster to handle the >>> additional disk requirements. It turns an inconsistency problem into a >>> major operational headache that can take weeks to resolve. >>> >>> >>> >>> >>> >>> On Thu, May 15, 2025 at 7:02 AM Paulo Motta <pauloricard...@gmail.com> >>> wrote: >>> >>> > There's bi-directional entropy issues with MV's - either orphaned view >>> data or missing view data; that's why you kind of need a "bi-directional >>> ETL" to make sure the 2 agree with each other. While normal repair would >>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV >>> that's not in base table anymore" case, which afaict all base consistency >>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable >>> to. >>> >>> I don't think that bi-directional reconciliation should be a >>> requirement, when the base table is assumed to be the source of truth as >>> stated in the CEP doc. >>> >>> I think the main issue with the current MV implementation is that each >>> view replica is independently replicated by the base replica, before the >>> base write is acknowledged. >>> >>> This creates a correctness issue in the write path, because a view >>> update can be created for a write that was not accepted by the coordinator >>> in the following scenario: >>> >>> N=RF=3 >>> CL=ONE >>> - Update U is propagated to view replica V, coordinator that is also >>> base replica B dies before accepting base table write request to client. >>> Now U exists in V but not in B. >>> >>> I think in order to address this, the view should be propagated to the >>> base replicas *after* it's accepted by all or a majority of base replicas. >>> This is where I think mutation tracking could probably help. >>> >>> I think this would ensure that as long as there's no data loss or >>> bit-rot, the base and view can be repaired independently. When there is >>> data loss or bit-rot in either the base table or the view, then it is the >>> same as 2i today: rebuild is required. >>> >>> > It'd be correct (if operationally disappointing) to be able to just >>> say "if you have data loss in your base table you need to rebuild the >>> corresponding MV's", but the problem is operators aren't always going to >>> know when that data loss occurs. Not everything is as visible as a lost >>> quorum of replicas or blown up SSTables. >>> >>> I think there are opportunities to improve rebuild speed, assuming the >>> base table as a source of truth. For example, rebuild only subranges when >>> data-loss is detected. >>> >>> On Thu, May 15, 2025 at 8:07 AM Josh McKenzie <jmcken...@apache.org> >>> wrote: >>> >>> >>> There's bi-directional entropy issues with MV's - either orphaned view >>> data or missing view data; that's why you kind of need a "bi-directional >>> ETL" to make sure the 2 agree with each other. While normal repair would >>> resolve the "missing data in MV" case, it wouldn't resolve the "data in MV >>> that's not in base table anymore" case, which afaict all base consistency >>> approaches (status quo, PaxosV2, Accord, Mutation Tracking) are vulnerable >>> to. >>> >>> It'd be correct (if operationally disappointing) to be able to just say >>> "if you have data loss in your base table you need to rebuild the >>> corresponding MV's", but the problem is operators aren't always going to >>> know when that data loss occurs. Not everything is as visible as a lost >>> quorum of replicas or blown up SSTables. >>> >>> On Wed, May 14, 2025, at 2:38 PM, Blake Eggleston wrote: >>> >>> Maybe, I’m not really familiar enough with how “classic” MV repair works >>> to say. You can’t mix normal repair and mutation reconciliation in the >>> current incarnation of mutation tracking though, so I wouldn’t assume it >>> would work with MVs. >>> >>> On Wed, May 14, 2025, at 11:29 AM, Jon Haddad wrote: >>> >>> In the case of bitrot / losing an SSTable, wouldn't a normal repair >>> (just the MV against the other nodes) resolve the issue? >>> >>> On Wed, May 14, 2025 at 11:27 AM Blake Eggleston <bl...@ultrablake.com> >>> wrote: >>> >>> >>> Mutation tracking is definitely an approach you could take for MVs. >>> Mutation reconciliation could be extended to ensure all changes have been >>> replicated to the views. When a base table received a mutation w/ an id it >>> would generate a view update. If you block marking a given mutation id as >>> reconciled until it’s been fully replicated to the base table and its view >>> updates have been fully replicated to the views, then all view updates will >>> eventually be applied as part of the log reconciliation process. >>> >>> A mutation tracking implementation would also allow you to be more >>> flexible with the types of consistency levels you can work with, allowing >>> users to do things like use LOCAL_QUORUM without leaving themselves open to >>> introducing view inconsistencies. >>> >>> That would more or less eliminate the need for any MV repair in normal >>> usage, but wouldn't address how to repair issues caused by bugs or data >>> loss, though you may be able to do something with comparing the latest >>> mutation ids for the base tables and its view ranges. >>> >>> On Wed, May 14, 2025, at 10:19 AM, Paulo Motta wrote: >>> >>> I don't see mutation tracking [1] mentioned in this thread or in the >>> CEP-48 description. Not sure this would fit into the scope of this >>> initial CEP, but I have a feeling that mutation tracking could be >>> potentially helpful to reconcile base tables and views ? >>> >>> For example, when both base and view updates are acknowledged then this >>> could be somehow persisted in the view sstables mutation tracking >>> summary[2] or similar metadata ? Then these updates would be skipped during >>> view repair, considerably reducing the amount of work needed, since only >>> un-acknowledged views updates would need to be reconciled. >>> >>> [1] - >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking| >>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-45%3A+Mutation+Tracking%7C> >>> [2] - <https://issues.apache.org/jira/browse/CASSANDRA-20336> >>> >>>