Okay, let’s put the efficiency discussion on hold for now. I want to make sure the actual repair process after detecting inconsistencies will work with the index-based solution.
When a mismatch is detected, the MV replica will need to stream its index file to the base table replica. The base table will then perform a comparison between the two files. There are three cases we need to handle: 1. Missing row in MV — this is straightforward; we can propagate the data to the MV. 2. Extra row in MV (assuming the tombstone is gone in the base table) — how should we fix this? 3. Inconsistency (timestamps don’t match) — it’s easy to fix when the base table has higher timestamps, but how do we resolve it when the MV columns have higher timestamps? Do we need to introduce a new kind of tombstone to shadow the rows in the MV for cases 2 and 3? If yes, how will this tombstone work? If no, how should we fix the MV data? On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston <bl...@ultrablake.com> wrote: > > hopefully we can come up with a solution that everyone agrees on. > > I’m sure we can, I think we’ve been making good progress > > > My main concern with the index-based solution is the overhead it adds to > the hot path, as well as having to build indexes periodically. > > So the additional overhead of maintaining a storage attached index on the > client write path is pretty minimal - it’s basically adding data to an in > memory trie. It’s a little extra work and memory usage, but there isn’t any > extra io or other blocking associated with it. I’d expect the latency > impact to be negligible. > > > As mentioned earlier, this MV repair should be an infrequent operation > > I don’t this that’s a safe assumption. There are a lot of situations > outside of data loss bugs where repair would need to be run. > > These use cases could probably be handled by repairing the view with other > view replicas: > > Scrubbing corrupt sstables > Node replacement via backup > > These use cases would need an actual MV repair to check consistency with > the base table: > > Restoring a cluster from a backup > Imported sstables via nodetool import > Data loss from operator error > Proactive consistency checks - ie preview repairs > > Even if it is an infrequent operation, when operators need it, it needs to > be available and reliable. > > It’s a fact that there are clusters where non-incremental repairs are run > on a cadence of a week or more to manage the overhead of validation > compactions. Assuming the cluster doesn’t have any additional headroom, > that would mean that any one of the above events could cause views to > remain out of sync for up to a week while the full set of merkle trees is > being built. > > This delay eliminates a lot of the value of repair as a risk mitigation > tool. If I had to make a recommendation where a bad call could cost me my > job, the prospect of a 7 day delay on repair would mean a strong no. > > Some users also run preview repair continuously to detect data consistency > errors, so at least a subset of users will probably be running MV repairs > continuously - at least in preview mode. > > That’s why I say that the replication path should be designed to never > need repair, and MV repair should be designed to be prepared for the worst. > > > I’m wondering if it’s possible to enable or disable index building > dynamically so that we don’t always incur the cost for something that’s > rarely needed. > > I think this would be a really reasonable compromise as long as the > default is on. That way it’s as safe as possible by default, but users who > don’t care or have a separate system for repairing MVs can opt out. > > > I’m not sure what you mean by “data problems” here. > > I mean out of sync views - either due to bugs, operator error, corruption, > etc > > > Also, this does scale with cluster size—I’ve compared it to full repair, > and this MV repair should behave similarly. That means as long as full > repair works, this repair should work as well. > > You could build the merkle trees at about the same cost as a full repair, > but the actual data repair path is completely different for MV, and that’s > the part that doesn’t scale well. As you know, with normal repair, we just > stream data for ranges detected as out of sync. For Mvs, since the data > isn’t in base partition order, the view data for an out of sync view range > needs to be read out and streamed to every base replica that it’s detected > a mismatch against. So in the example I gave with the 300 node cluster, > you’re looking at reading and transmitting the same partition at least 100 > times in the best case, and the cost of this keeps going up as the cluster > increases in size. That's the part that doesn't scale well. > > This is also one the benefits of the index design. Since it stores data in > segments that roughly correspond to points on the grid, you’re not > rereading the same data over and over. A repair for a given grid point only > reads an amount of data proportional to the data in common for the > base/view grid point, and it’s stored in a small enough granularity that > the base can calculate what data needs to be sent to the view without > having to read the entire view partition. > > On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote: > > Thanks, Blake. I’m open to iterating on the design, and hopefully we can > come up with a solution that everyone agrees on. > > My main concern with the index-based solution is the overhead it adds to > the hot path, as well as having to build indexes periodically. As mentioned > earlier, this MV repair should be an infrequent operation, but the > index-based approach shifts some of the work to the hot path in order to > allow repairs that touch only a few nodes. > > I’m wondering if it’s possible to enable or disable index building > dynamically so that we don’t always incur the cost for something that’s > rarely needed. > > > it degrades operators ability to react to data problems by imposing a > significant upfront processing burden on repair, and that it doesn’t scale > well with cluster size > > I’m not sure what you mean by “data problems” here. Also, this does scale > with cluster size—I’ve compared it to full repair, and this MV repair > should behave similarly. That means as long as full repair works, this > repair should work as well. > > For example, regardless of how large the cluster is, you can always enable > Merkle tree building on 10% of the nodes at a time until all the trees are > ready. > > I understand that coordinating this type of repair is harder than what we > currently support, but with CEP-37, we should be able to handle this > coordination without adding too much burden on the operator side. > > On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston <bl...@ultrablake.com> > wrote: > > > I don't see any outcome here that is good for the community though. Either > Runtian caves and adopts your design that he (and I) consider inferior, or > he is prevented from contributing this work. > > > Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a > competition. We can collaborate and figure out the best approach to the > problem. I’d like to keep discussing it if you’re open to iterating on the > design. > > I’m not married to our proposal, it’s just the cleanest way we could think > of to address what Jon and I both see as blockers in the current proposal. > It’s not set in stone though. > > On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote: > > Hmm, I am very surprised as I helped write that and I distinctly recall a > specific goal was avoiding binding vetoes as they're so toxic. > > Ok, I guess you can block this work if you like. > > I don't see any outcome here that is good for the community though. Either > Runtian caves and adopts your design that he (and I) consider inferior, or > he is prevented from contributing this work. That isn't a functioning > community in my mind, so I'll be noping out for a while, as I don't see > much value here right now. > > > On 2025/06/06 18:31:08 Blake Eggleston wrote: > > Hi Benedict, that’s actually not true. > > > > Here’s a link to the project governance page: _https:// > cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_ > > > > The CEP section says: > > > > “*Once the proposal is finalized and any major committer dissent > reconciled, call a [VOTE] on the ML to have the proposal adopted. The > criteria for acceptance is consensus (3 binding +1 votes and no binding > vetoes). The vote should remain open for 72 hours.*” > > > > So they’re definitely vetoable. > > > > Also note the part about “*Once the proposal is finalized and any major > committer dissent reconciled,*” being a prerequisite for moving a CEP to > [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even > be appropriate to move to a VOTE until we get to the bottom of this repair > discussion. > > > > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote: > > > > but the snapshot repair design is not a viable path forward. It’s > the first iteration of a repair design. We’ve proposed a second iteration, > and we’re open to a third iteration. > > > > > > I shan't be participating further in discussion, but I want to make a > point of order. The CEP process has no vetoes, so you are not empowered to > declare that a design is not viable without the input of the wider > community. > > > > > > > > > On 2025/06/05 03:58:59 Blake Eggleston wrote: > > > > You can detect and fix the mismatch in a single round of repair, but > the amount of work needed to do it is _significantly_ higher with snapshot > repair. Consider a case where we have a 300 node cluster w/ RF 3, where > each view partition contains entries mapping to every token range in the > cluster - so 100 ranges. If we lose a view sstable, it will affect an > entire row/column of the grid. Repair is going to scan all data in the > mismatching view token ranges 100 times, and each base range once. So > you’re looking at 200 range scans. > > > > > > > > Now, you may argue that you can merge the duplicate view scans into > a single scan while you repair all token ranges in parallel. I’m skeptical > that’s going to be achievable in practice, but even if it is, we’re now > talking about the view replica hypothetically doing a pairwise repair with > every other replica in the cluster at the same time. Neither of these > options is workable. > > > > > > > > Let’s take a step back though, because I think we’re getting lost in > the weeds. > > > > > > > > The repair design in the CEP has some high level concepts that make > a lot of sense, the idea of repairing a grid is really smart. However, it > has some significant drawbacks that remain unaddressed. I want this CEP to > succeed, and I know Jon does too, but the snapshot repair design is not a > viable path forward. It’s the first iteration of a repair design. We’ve > proposed a second iteration, and we’re open to a third iteration. This part > of the CEP process is meant to identify and address shortcomings, I don’t > think that continuing to dissect the snapshot repair design is making > progress in that direction. > > > > > > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote: > > > > > > We potentially have to do it several times on each node, > depending on the size of the range. Smaller ranges increase the size of the > board exponentially, larger ranges increase the number of SSTables that > would be involved in each compaction. > > > > > As described in the CEP example, this can be handled in a single > round of repair. We first identify all the points in the grid that require > repair, then perform anti-compaction and stream data based on a second scan > over those identified points. This applies to the snapshot-based > solution—without an index, repairing a single point in that grid requires > scanning the entire base table partition (token range). In contrast, with > the index-based solution—as in the example you referenced—if a large block > of data is corrupted, even though the index is used for comparison, many > key mismatches may occur. This can lead to random disk access to the > original data files, which could cause performance issues. For the case you > mentioned for snapshot based solution, it should not take months to repair > all the data, instead one round of repair should be enough. The actual > repair phase is split from the detection phase. > > > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad < > j...@rustyrazorblade.com> wrote: > > > > >> > This isn’t really the whole story. The amount of wasted scans > on index repairs is negligible. If a difference is detected with snapshot > repairs though, you have to read the entire partition from both the view > and base table to calculate what needs to be fixed. > > > > >> > > > > >> You nailed it. > > > > >> > > > > >> When the base table is converted to a view, and sent to the view, > the information we have is that one of the view's partition keys needs a > repair. That's going to be different from the partition key of the base > table. As a result, on the base table, for each affected range, we'd have > to issue another compaction across the entire set of sstables that could > have the data the view needs (potentially many GB), in order to send over > the corrected version of the partition, then send it over to the view. > Without an index in place, we have to do yet another scan, per-affected > range. > > > > >> > > > > >> Consider the case of a single corrupted SSTable on the view > that's removed from the filesystem, or the data is simply missing after > being restored from an inconsistent backup. It presumably contains lots of > partitions, which maps to base partitions all over the cluster, in a lot of > different token ranges. For every one of those ranges (hundreds, to tens > of thousands of them given the checkerboard design), when finding the > missing data in the base, you'll have to perform a compaction across all > the SSTables that potentially contain the missing data just to rebuild the > view-oriented partitions that need to be sent to the view. The complexity > of this operation can be looked at as O(N*M) where N and M are the number > of ranges in the base table and the view affected by the corruption, > respectively. Without an index in place, finding the missing data is very > expensive. We potentially have to do it several times on each node, > depending on the size of the range. Smaller ranges increase the size of > the board exponentially, larger ranges increase the number of SSTables that > would be involved in each compaction. > > > > >> > > > > >> Then you send that data over to the view, the view does it's > anti-compaction thing, again, once per affected range. So now the view has > to do an anti-compaction once per block on the board that's affected by the > missing data. > > > > >> > > > > >> Doing hundreds or thousands of these will add up pretty quickly. > > > > >> > > > > >> When I said that a repair could take months, this is what I had > in mind. > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston < > bl...@ultrablake.com> wrote: > > > > >>> __ > > > > >>> > Adds overhead in the hot path due to maintaining indexes. > Extra memory needed during write path and compaction. > > > > >>> > > > > >>> I’d make the same argument about the overhead of maintaining the > index that Jon just made about the disk space required. The relatively > predictable overhead of maintaining the index as part of the write and > compaction paths is a pro, not a con. Although you’re not always paying the > cost of building a merkle tree with snapshot repair, it can impact the hot > path and you do have to plan for it. > > > > >>> > > > > >>> > Verifies index content, not actual data—may miss > low-probability errors like bit flips > > > > >>> > > > > >>> Presumably this could be handled by the views performing repair > against each other? You could also periodically rebuild the index or > perform checksums against the sstable content. > > > > >>> > > > > >>> > Extra data scan during inconsistency detection > > > > >>> > Index: Since the data covered by certain indexes is not > guaranteed to be fully contained within a single node as the topology > changes, some data scans may be wasted. > > > > >>> > Snapshots: No extra data scan > > > > >>> > > > > >>> This isn’t really the whole story. The amount of wasted scans on > index repairs is negligible. If a difference is detected with snapshot > repairs though, you have to read the entire partition from both the view > and base table to calculate what needs to be fixed. > > > > >>> > > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote: > > > > >>>> One practical aspect that isn't immediately obvious is the disk > space consideration for snapshots. > > > > >>>> > > > > >>>> When you have a table with a mixed workload using LCS or UCS > with scaling parameters like L10 and initiate a repair, the disk usage will > increase as long as the snapshot persists and the table continues to > receive writes. This aspect is understood and factored into the design. > > > > >>>> > > > > >>>> However, a more nuanced point is the necessity to maintain > sufficient disk headroom specifically for running repairs. This echoes the > challenge with STCS compaction, where enough space must be available to > accommodate the largest SSTables, even when they are not being actively > compacted. > > > > >>>> > > > > >>>> For example, if a repair involves rewriting 100GB of SSTable > data, you'll consistently need to reserve 100GB of free space to facilitate > this. > > > > >>>> > > > > >>>> Therefore, while the snapshot-based approach leads to variable > disk space utilization, operators must provision storage as if the maximum > potential space will be used at all times to ensure repairs can be executed. > > > > >>>> > > > > >>>> This introduces a rate of churn dynamic, where the write > throughput dictates the required extra disk space, rather than the existing > on-disk data volume. > > > > >>>> > > > > >>>> If 50% of your SSTables are rewritten during a snapshot, you > would need 50% free disk space. Depending on the workload, the snapshot > method could consume significantly more disk space than an index-based > approach. Conversely, for relatively static workloads, the index method > might require more space. It's not as straightforward as stating "No extra > disk space needed". > > > > >>>> > > > > >>>> Jon > > > > >>>> > > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com> > wrote: > > > > >>>>> > Regarding your comparison between approaches, I think you > also need to take into account the other dimensions that have been brought > up in this thread. Things like minimum repair times and vulnerability to > outages and topology changes are the first that come to mind. > > > > >>>>> > > > > >>>>> Sure, I added a few more points. > > > > >>>>> > > > > >>>>> *Perspective* > > > > >>>>> > > > > >>>>> *Index-Based Solution* > > > > >>>>> > > > > >>>>> *Snapshot-Based Solution* > > > > >>>>> > > > > >>>>> 1. Hot path overhead > > > > >>>>> > > > > >>>>> Adds overhead in the hot path due to maintaining indexes. > Extra memory needed during write path and compaction. > > > > >>>>> > > > > >>>>> No impact on the hot path > > > > >>>>> > > > > >>>>> 2. Extra disk usage when repair is not running > > > > >>>>> > > > > >>>>> Requires additional disk space to store persistent indexes > > > > >>>>> > > > > >>>>> No extra disk space needed > > > > >>>>> > > > > >>>>> 3. Extra disk usage during repair > > > > >>>>> > > > > >>>>> Minimal or no additional disk usage > > > > >>>>> > > > > >>>>> Requires additional disk space for snapshots > > > > >>>>> > > > > >>>>> 4. Fine-grained repair to deal with emergency situations / > topology changes > > > > >>>>> > > > > >>>>> Supports fine-grained repairs by targeting specific index > ranges. This allows repair to be retried on smaller data sets, enabling > incremental progress when repairing the entire table. This is especially > helpful when there are down nodes or topology changes during repair, which > are common in day-to-day operations. > > > > >>>>> > > > > >>>>> Coordination across all nodes is required over a long period > of time. For each round of repair, if all replica nodes are down or if > there is a topology change, the data ranges that were not covered will need > to be repaired in the next round. > > > > >>>>> > > > > >>>>> > > > > >>>>> 5. Validating data used in reads directly > > > > >>>>> > > > > >>>>> Verifies index content, not actual data—may miss > low-probability errors like bit flips > > > > >>>>> > > > > >>>>> Verifies actual data content, providing stronger correctness > guarantees > > > > >>>>> > > > > >>>>> 6. Extra data scan during inconsistency detection > > > > >>>>> > > > > >>>>> Since the data covered by certain indexes is not guaranteed to > be fully contained within a single node as the topology changes, some data > scans may be wasted. > > > > >>>>> > > > > >>>>> No extra data scan > > > > >>>>> > > > > >>>>> 7. The overhead of actual data repair after an inconsistency > is detected > > > > >>>>> > > > > >>>>> Only indexes are streamed to the base table node, and the > actual data being fixed can be as accurate as the row level. > > > > >>>>> > > > > >>>>> Anti-compaction is needed on the MV table, and potential > over-streaming may occur due to the lack of row-level insight into data > quality. > > > > >>>>> > > > > >>>>> > > > > >>>>> > one of my biggest concerns I haven't seen discussed much is > LOCAL_SERIAL/SERIAL on read > > > > >>>>> > > > > >>>>> Paxos v2 introduces an optimization where serial reads can be > completed in just one round trip, reducing latency compared to traditional > Paxos which may require multiple phases. > > > > >>>>> > > > > >>>>> > I think a refresh would be low-cost and give users the > flexibility to run them however they want. > > > > >>>>> > > > > >>>>> I think this is an interesting idea. Does it suggest that the > MV should be rebuilt on a regular schedule? It sounds like an extension of > the snapshot-based approach—rather than detecting mismatches, we would > periodically reconstruct a clean version of the MV based on the snapshot. > This seems to diverge from the current MV model in Cassandra, where > consistency between the MV and base table must be maintained continuously. > This could be an extension of the CEP-48 work, where the MV is periodically > rebuilt from a snapshot of the base table, assuming the user can tolerate > some level of staleness in the MV data. > > > > >>>>> > > > > >>> > > > > > > > > > > > > >