Hi Benedict, that’s actually not true. Here’s a link to the project governance page: _https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
The CEP section says: “*Once the proposal is finalized and any major committer dissent reconciled, call a [VOTE] on the ML to have the proposal adopted. The criteria for acceptance is consensus (3 binding +1 votes and no binding vetoes). The vote should remain open for 72 hours.*” So they’re definitely vetoable. Also note the part about “*Once the proposal is finalized and any major committer dissent reconciled,*” being a prerequisite for moving a CEP to [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even be appropriate to move to a VOTE until we get to the bottom of this repair discussion. On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote: > > but the snapshot repair design is not a viable path forward. It’s the first > > iteration of a repair design. We’ve proposed a second iteration, and we’re > > open to a third iteration. > > I shan't be participating further in discussion, but I want to make a point > of order. The CEP process has no vetoes, so you are not empowered to declare > that a design is not viable without the input of the wider community. > > > On 2025/06/05 03:58:59 Blake Eggleston wrote: > > You can detect and fix the mismatch in a single round of repair, but the > > amount of work needed to do it is _significantly_ higher with snapshot > > repair. Consider a case where we have a 300 node cluster w/ RF 3, where > > each view partition contains entries mapping to every token range in the > > cluster - so 100 ranges. If we lose a view sstable, it will affect an > > entire row/column of the grid. Repair is going to scan all data in the > > mismatching view token ranges 100 times, and each base range once. So > > you’re looking at 200 range scans. > > > > Now, you may argue that you can merge the duplicate view scans into a > > single scan while you repair all token ranges in parallel. I’m skeptical > > that’s going to be achievable in practice, but even if it is, we’re now > > talking about the view replica hypothetically doing a pairwise repair with > > every other replica in the cluster at the same time. Neither of these > > options is workable. > > > > Let’s take a step back though, because I think we’re getting lost in the > > weeds. > > > > The repair design in the CEP has some high level concepts that make a lot > > of sense, the idea of repairing a grid is really smart. However, it has > > some significant drawbacks that remain unaddressed. I want this CEP to > > succeed, and I know Jon does too, but the snapshot repair design is not a > > viable path forward. It’s the first iteration of a repair design. We’ve > > proposed a second iteration, and we’re open to a third iteration. This part > > of the CEP process is meant to identify and address shortcomings, I don’t > > think that continuing to dissect the snapshot repair design is making > > progress in that direction. > > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote: > > > > We potentially have to do it several times on each node, depending on > > > > the size of the range. Smaller ranges increase the size of the board > > > > exponentially, larger ranges increase the number of SSTables that would > > > > be involved in each compaction. > > > As described in the CEP example, this can be handled in a single round of > > > repair. We first identify all the points in the grid that require repair, > > > then perform anti-compaction and stream data based on a second scan over > > > those identified points. This applies to the snapshot-based > > > solution—without an index, repairing a single point in that grid requires > > > scanning the entire base table partition (token range). In contrast, with > > > the index-based solution—as in the example you referenced—if a large > > > block of data is corrupted, even though the index is used for comparison, > > > many key mismatches may occur. This can lead to random disk access to the > > > original data files, which could cause performance issues. For the case > > > you mentioned for snapshot based solution, it should not take months to > > > repair all the data, instead one round of repair should be enough. The > > > actual repair phase is split from the detection phase. > > > > > > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad <j...@rustyrazorblade.com> > > > wrote: > > >> > This isn’t really the whole story. The amount of wasted scans on index > > >> > repairs is negligible. If a difference is detected with snapshot > > >> > repairs though, you have to read the entire partition from both the > > >> > view and base table to calculate what needs to be fixed. > > >> > > >> You nailed it. > > >> > > >> When the base table is converted to a view, and sent to the view, the > > >> information we have is that one of the view's partition keys needs a > > >> repair. That's going to be different from the partition key of the base > > >> table. As a result, on the base table, for each affected range, we'd > > >> have to issue another compaction across the entire set of sstables that > > >> could have the data the view needs (potentially many GB), in order to > > >> send over the corrected version of the partition, then send it over to > > >> the view. Without an index in place, we have to do yet another scan, > > >> per-affected range. > > >> > > >> Consider the case of a single corrupted SSTable on the view that's > > >> removed from the filesystem, or the data is simply missing after being > > >> restored from an inconsistent backup. It presumably contains lots of > > >> partitions, which maps to base partitions all over the cluster, in a lot > > >> of different token ranges. For every one of those ranges (hundreds, to > > >> tens of thousands of them given the checkerboard design), when finding > > >> the missing data in the base, you'll have to perform a compaction across > > >> all the SSTables that potentially contain the missing data just to > > >> rebuild the view-oriented partitions that need to be sent to the view. > > >> The complexity of this operation can be looked at as O(N*M) where N and > > >> M are the number of ranges in the base table and the view affected by > > >> the corruption, respectively. Without an index in place, finding the > > >> missing data is very expensive. We potentially have to do it several > > >> times on each node, depending on the size of the range. Smaller ranges > > >> increase the size of the board exponentially, larger ranges increase the > > >> number of SSTables that would be involved in each compaction. > > >> > > >> Then you send that data over to the view, the view does it's > > >> anti-compaction thing, again, once per affected range. So now the view > > >> has to do an anti-compaction once per block on the board that's affected > > >> by the missing data. > > >> > > >> Doing hundreds or thousands of these will add up pretty quickly. > > >> > > >> When I said that a repair could take months, this is what I had in mind. > > >> > > >> > > >> > > >> > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston <bl...@ultrablake.com> > > >> wrote: > > >>> __ > > >>> > Adds overhead in the hot path due to maintaining indexes. Extra > > >>> > memory needed during write path and compaction. > > >>> > > >>> I’d make the same argument about the overhead of maintaining the index > > >>> that Jon just made about the disk space required. The relatively > > >>> predictable overhead of maintaining the index as part of the write and > > >>> compaction paths is a pro, not a con. Although you’re not always paying > > >>> the cost of building a merkle tree with snapshot repair, it can impact > > >>> the hot path and you do have to plan for it. > > >>> > > >>> > Verifies index content, not actual data—may miss low-probability > > >>> > errors like bit flips > > >>> > > >>> Presumably this could be handled by the views performing repair against > > >>> each other? You could also periodically rebuild the index or perform > > >>> checksums against the sstable content. > > >>> > > >>> > Extra data scan during inconsistency detection > > >>> > Index: Since the data covered by certain indexes is not guaranteed to > > >>> > be fully contained within a single node as the topology changes, some > > >>> > data scans may be wasted. > > >>> > Snapshots: No extra data scan > > >>> > > >>> This isn’t really the whole story. The amount of wasted scans on index > > >>> repairs is negligible. If a difference is detected with snapshot > > >>> repairs though, you have to read the entire partition from both the > > >>> view and base table to calculate what needs to be fixed. > > >>> > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote: > > >>>> One practical aspect that isn't immediately obvious is the disk space > > >>>> consideration for snapshots. > > >>>> > > >>>> When you have a table with a mixed workload using LCS or UCS with > > >>>> scaling parameters like L10 and initiate a repair, the disk usage will > > >>>> increase as long as the snapshot persists and the table continues to > > >>>> receive writes. This aspect is understood and factored into the design. > > >>>> > > >>>> However, a more nuanced point is the necessity to maintain sufficient > > >>>> disk headroom specifically for running repairs. This echoes the > > >>>> challenge with STCS compaction, where enough space must be available > > >>>> to accommodate the largest SSTables, even when they are not being > > >>>> actively compacted. > > >>>> > > >>>> For example, if a repair involves rewriting 100GB of SSTable data, > > >>>> you'll consistently need to reserve 100GB of free space to facilitate > > >>>> this. > > >>>> > > >>>> Therefore, while the snapshot-based approach leads to variable disk > > >>>> space utilization, operators must provision storage as if the maximum > > >>>> potential space will be used at all times to ensure repairs can be > > >>>> executed. > > >>>> > > >>>> This introduces a rate of churn dynamic, where the write throughput > > >>>> dictates the required extra disk space, rather than the existing > > >>>> on-disk data volume. > > >>>> > > >>>> If 50% of your SSTables are rewritten during a snapshot, you would > > >>>> need 50% free disk space. Depending on the workload, the snapshot > > >>>> method could consume significantly more disk space than an index-based > > >>>> approach. Conversely, for relatively static workloads, the index > > >>>> method might require more space. It's not as straightforward as > > >>>> stating "No extra disk space needed". > > >>>> > > >>>> Jon > > >>>> > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com> wrote: > > >>>>> > Regarding your comparison between approaches, I think you also need > > >>>>> > to take into account the other dimensions that have been brought up > > >>>>> > in this thread. Things like minimum repair times and vulnerability > > >>>>> > to outages and topology changes are the first that come to mind. > > >>>>> > > >>>>> Sure, I added a few more points. > > >>>>> > > >>>>> *Perspective* > > >>>>> > > >>>>> *Index-Based Solution* > > >>>>> > > >>>>> *Snapshot-Based Solution* > > >>>>> > > >>>>> 1. Hot path overhead > > >>>>> > > >>>>> Adds overhead in the hot path due to maintaining indexes. Extra > > >>>>> memory needed during write path and compaction. > > >>>>> > > >>>>> No impact on the hot path > > >>>>> > > >>>>> 2. Extra disk usage when repair is not running > > >>>>> > > >>>>> Requires additional disk space to store persistent indexes > > >>>>> > > >>>>> No extra disk space needed > > >>>>> > > >>>>> 3. Extra disk usage during repair > > >>>>> > > >>>>> Minimal or no additional disk usage > > >>>>> > > >>>>> Requires additional disk space for snapshots > > >>>>> > > >>>>> 4. Fine-grained repair to deal with emergency situations / topology > > >>>>> changes > > >>>>> > > >>>>> Supports fine-grained repairs by targeting specific index ranges. > > >>>>> This allows repair to be retried on smaller data sets, enabling > > >>>>> incremental progress when repairing the entire table. This is > > >>>>> especially helpful when there are down nodes or topology changes > > >>>>> during repair, which are common in day-to-day operations. > > >>>>> > > >>>>> Coordination across all nodes is required over a long period of time. > > >>>>> For each round of repair, if all replica nodes are down or if there > > >>>>> is a topology change, the data ranges that were not covered will need > > >>>>> to be repaired in the next round. > > >>>>> > > >>>>> > > >>>>> 5. Validating data used in reads directly > > >>>>> > > >>>>> Verifies index content, not actual data—may miss low-probability > > >>>>> errors like bit flips > > >>>>> > > >>>>> Verifies actual data content, providing stronger correctness > > >>>>> guarantees > > >>>>> > > >>>>> 6. Extra data scan during inconsistency detection > > >>>>> > > >>>>> Since the data covered by certain indexes is not guaranteed to be > > >>>>> fully contained within a single node as the topology changes, some > > >>>>> data scans may be wasted. > > >>>>> > > >>>>> No extra data scan > > >>>>> > > >>>>> 7. The overhead of actual data repair after an inconsistency is > > >>>>> detected > > >>>>> > > >>>>> Only indexes are streamed to the base table node, and the actual data > > >>>>> being fixed can be as accurate as the row level. > > >>>>> > > >>>>> Anti-compaction is needed on the MV table, and potential > > >>>>> over-streaming may occur due to the lack of row-level insight into > > >>>>> data quality. > > >>>>> > > >>>>> > > >>>>> > one of my biggest concerns I haven't seen discussed much is > > >>>>> > LOCAL_SERIAL/SERIAL on read > > >>>>> > > >>>>> Paxos v2 introduces an optimization where serial reads can be > > >>>>> completed in just one round trip, reducing latency compared to > > >>>>> traditional Paxos which may require multiple phases. > > >>>>> > > >>>>> > I think a refresh would be low-cost and give users the flexibility > > >>>>> > to run them however they want. > > >>>>> > > >>>>> I think this is an interesting idea. Does it suggest that the MV > > >>>>> should be rebuilt on a regular schedule? It sounds like an extension > > >>>>> of the snapshot-based approach—rather than detecting mismatches, we > > >>>>> would periodically reconstruct a clean version of the MV based on the > > >>>>> snapshot. This seems to diverge from the current MV model in > > >>>>> Cassandra, where consistency between the MV and base table must be > > >>>>> maintained continuously. This could be an extension of the CEP-48 > > >>>>> work, where the MV is periodically rebuilt from a snapshot of the > > >>>>> base table, assuming the user can tolerate some level of staleness in > > >>>>> the MV data. > > >>>>> > > >>> > > >