Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Blake Eggleston Fri, 06 Jun 2025 11:32:30 -0700

Hi Benedict, that’s actually not true. 

Here’s a link to the project governance page: 
_https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_


The CEP section says:

“*Once the proposal is finalized and any major committer dissent reconciled, 
call a [VOTE] on the ML to have the proposal adopted. The criteria for 
acceptance is consensus (3 binding +1 votes and no binding vetoes). The vote 
should remain open for 72 hours.*”

So they’re definitely vetoable. 

Also note the part about “*Once the proposal is finalized and any major 
committer dissent reconciled,*” being a prerequisite for moving a CEP to 
[VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even be 
appropriate to move to a VOTE until we get to the bottom of this repair 
discussion.

On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
> > but the snapshot repair design is not a viable path forward. It’s the first 
> > iteration of a repair design. We’ve proposed a second iteration, and we’re 
> > open to a third iteration.
> 
> I shan't be participating further in discussion, but I want to make a point 
> of order. The CEP process has no vetoes, so you are not empowered to declare 
> that a design is not viable without the input of the wider community.
> 
> 
> On 2025/06/05 03:58:59 Blake Eggleston wrote:
> > You can detect and fix the mismatch in a single round of repair, but the 
> > amount of work needed to do it is _significantly_ higher with snapshot 
> > repair. Consider a case where we have a 300 node cluster w/ RF 3, where 
> > each view partition contains entries mapping to every token range in the 
> > cluster - so 100 ranges. If we lose a view sstable, it will affect an 
> > entire row/column of the grid. Repair is going to scan all data in the 
> > mismatching view token ranges 100 times, and each base range once. So 
> > you’re looking at 200 range scans.
> > 
> > Now, you may argue that you can merge the duplicate view scans into a 
> > single scan while you repair all token ranges in parallel. I’m skeptical 
> > that’s going to be achievable in practice, but even if it is, we’re now 
> > talking about the view replica hypothetically doing a pairwise repair with 
> > every other replica in the cluster at the same time. Neither of these 
> > options is workable.
> > 
> > Let’s take a step back though, because I think we’re getting lost in the 
> > weeds.
> > 
> > The repair design in the CEP has some high level concepts that make a lot 
> > of sense, the idea of repairing a grid is really smart. However, it has 
> > some significant drawbacks that remain unaddressed. I want this CEP to 
> > succeed, and I know Jon does too, but the snapshot repair design is not a 
> > viable path forward. It’s the first iteration of a repair design. We’ve 
> > proposed a second iteration, and we’re open to a third iteration. This part 
> > of the CEP process is meant to identify and address shortcomings, I don’t 
> > think that continuing to dissect the snapshot repair design is making 
> > progress in that direction.
> > 
> > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > > >  We potentially have to do it several times on each node, depending on 
> > > > the size of the range. Smaller ranges increase the size of the board 
> > > > exponentially, larger ranges increase the number of SSTables that would 
> > > > be involved in each compaction.
> > > As described in the CEP example, this can be handled in a single round of 
> > > repair. We first identify all the points in the grid that require repair, 
> > > then perform anti-compaction and stream data based on a second scan over 
> > > those identified points. This applies to the snapshot-based 
> > > solution—without an index, repairing a single point in that grid requires 
> > > scanning the entire base table partition (token range). In contrast, with 
> > > the index-based solution—as in the example you referenced—if a large 
> > > block of data is corrupted, even though the index is used for comparison, 
> > > many key mismatches may occur. This can lead to random disk access to the 
> > > original data files, which could cause performance issues. For the case 
> > > you mentioned for snapshot based solution, it should not take months to 
> > > repair all the data, instead one round of repair should be enough. The 
> > > actual repair phase is split from the detection phase.
> > > 
> > > 
> > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad <j...@rustyrazorblade.com> 
> > > wrote:
> > >> > This isn’t really the whole story. The amount of wasted scans on index 
> > >> > repairs is negligible. If a difference is detected with snapshot 
> > >> > repairs though, you have to read the entire partition from both the 
> > >> > view and base table to calculate what needs to be fixed.
> > >> 
> > >> You nailed it.
> > >> 
> > >> When the base table is converted to a view, and sent to the view, the 
> > >> information we have is that one of the view's partition keys needs a 
> > >> repair.  That's going to be different from the partition key of the base 
> > >> table.  As a result, on the base table, for each affected range, we'd 
> > >> have to issue another compaction across the entire set of sstables that 
> > >> could have the data the view needs (potentially many GB), in order to 
> > >> send over the corrected version of the partition, then send it over to 
> > >> the view.  Without an index in place, we have to do yet another scan, 
> > >> per-affected range.  
> > >> 
> > >> Consider the case of a single corrupted SSTable on the view that's 
> > >> removed from the filesystem, or the data is simply missing after being 
> > >> restored from an inconsistent backup.  It presumably contains lots of 
> > >> partitions, which maps to base partitions all over the cluster, in a lot 
> > >> of different token ranges.  For every one of those ranges (hundreds, to 
> > >> tens of thousands of them given the checkerboard design), when finding 
> > >> the missing data in the base, you'll have to perform a compaction across 
> > >> all the SSTables that potentially contain the missing data just to 
> > >> rebuild the view-oriented partitions that need to be sent to the view.  
> > >> The complexity of this operation can be looked at as O(N*M) where N and 
> > >> M are the number of ranges in the base table and the view affected by 
> > >> the corruption, respectively.  Without an index in place, finding the 
> > >> missing data is very expensive.  We potentially have to do it several 
> > >> times on each node, depending on the size of the range.  Smaller ranges 
> > >> increase the size of the board exponentially, larger ranges increase the 
> > >> number of SSTables that would be involved in each compaction.
> > >> 
> > >> Then you send that data over to the view, the view does it's 
> > >> anti-compaction thing, again, once per affected range.  So now the view 
> > >> has to do an anti-compaction once per block on the board that's affected 
> > >> by the missing data.  
> > >> 
> > >> Doing hundreds or thousands of these will add up pretty quickly.
> > >> 
> > >> When I said that a repair could take months, this is what I had in mind.
> > >> 
> > >> 
> > >> 
> > >> 
> > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston <bl...@ultrablake.com> 
> > >> wrote:
> > >>> __
> > >>> > Adds overhead in the hot path due to maintaining indexes. Extra 
> > >>> > memory needed during write path and compaction.
> > >>> 
> > >>> I’d make the same argument about the overhead of maintaining the index 
> > >>> that Jon just made about the disk space required. The relatively 
> > >>> predictable overhead of maintaining the index as part of the write and 
> > >>> compaction paths is a pro, not a con. Although you’re not always paying 
> > >>> the cost of building a merkle tree with snapshot repair, it can impact 
> > >>> the hot path and you do have to plan for it.
> > >>> 
> > >>> > Verifies index content, not actual data—may miss low-probability 
> > >>> > errors like bit flips
> > >>> 
> > >>> Presumably this could be handled by the views performing repair against 
> > >>> each other? You could also periodically rebuild the index or perform 
> > >>> checksums against the sstable content.
> > >>> 
> > >>> > Extra data scan during inconsistency detection
> > >>> > Index: Since the data covered by certain indexes is not guaranteed to 
> > >>> > be fully contained within a single node as the topology changes, some 
> > >>> > data scans may be wasted.
> > >>> > Snapshots: No extra data scan
> > >>> 
> > >>> This isn’t really the whole story. The amount of wasted scans on index 
> > >>> repairs is negligible. If a difference is detected with snapshot 
> > >>> repairs though, you have to read the entire partition from both the 
> > >>> view and base table to calculate what needs to be fixed.
> > >>> 
> > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
> > >>>> One practical aspect that isn't immediately obvious is the disk space 
> > >>>> consideration for snapshots.
> > >>>> 
> > >>>> When you have a table with a mixed workload using LCS or UCS with 
> > >>>> scaling parameters like L10 and initiate a repair, the disk usage will 
> > >>>> increase as long as the snapshot persists and the table continues to 
> > >>>> receive writes. This aspect is understood and factored into the design.
> > >>>> 
> > >>>> However, a more nuanced point is the necessity to maintain sufficient 
> > >>>> disk headroom specifically for running repairs. This echoes the 
> > >>>> challenge with STCS compaction, where enough space must be available 
> > >>>> to accommodate the largest SSTables, even when they are not being 
> > >>>> actively compacted.
> > >>>> 
> > >>>> For example, if a repair involves rewriting 100GB of SSTable data, 
> > >>>> you'll consistently need to reserve 100GB of free space to facilitate 
> > >>>> this.
> > >>>> 
> > >>>> Therefore, while the snapshot-based approach leads to variable disk 
> > >>>> space utilization, operators must provision storage as if the maximum 
> > >>>> potential space will be used at all times to ensure repairs can be 
> > >>>> executed.
> > >>>> 
> > >>>> This introduces a rate of churn dynamic, where the write throughput 
> > >>>> dictates the required extra disk space, rather than the existing 
> > >>>> on-disk data volume.
> > >>>> 
> > >>>> If 50% of your SSTables are rewritten during a snapshot, you would 
> > >>>> need 50% free disk space. Depending on the workload, the snapshot 
> > >>>> method could consume significantly more disk space than an index-based 
> > >>>> approach. Conversely, for relatively static workloads, the index 
> > >>>> method might require more space. It's not as straightforward as 
> > >>>> stating "No extra disk space needed".
> > >>>> 
> > >>>> Jon
> > >>>> 
> > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com> wrote:
> > >>>>> > Regarding your comparison between approaches, I think you also need 
> > >>>>> > to take into account the other dimensions that have been brought up 
> > >>>>> > in this thread. Things like minimum repair times and vulnerability 
> > >>>>> > to outages and topology changes are the first that come to mind.
> > >>>>> 
> > >>>>> Sure, I added a few more points.
> > >>>>> 
> > >>>>> *Perspective*
> > >>>>> 
> > >>>>> *Index-Based Solution*
> > >>>>> 
> > >>>>> *Snapshot-Based Solution*
> > >>>>> 
> > >>>>> 1. Hot path overhead
> > >>>>> 
> > >>>>> Adds overhead in the hot path due to maintaining indexes. Extra 
> > >>>>> memory needed during write path and compaction.
> > >>>>> 
> > >>>>> No impact on the hot path
> > >>>>> 
> > >>>>> 2. Extra disk usage when repair is not running
> > >>>>> 
> > >>>>> Requires additional disk space to store persistent indexes
> > >>>>> 
> > >>>>> No extra disk space needed
> > >>>>> 
> > >>>>> 3. Extra disk usage during repair
> > >>>>> 
> > >>>>> Minimal or no additional disk usage
> > >>>>> 
> > >>>>> Requires additional disk space for snapshots
> > >>>>> 
> > >>>>> 4. Fine-grained repair  to deal with emergency situations / topology 
> > >>>>> changes
> > >>>>> 
> > >>>>> Supports fine-grained repairs by targeting specific index ranges. 
> > >>>>> This allows repair to be retried on smaller data sets, enabling 
> > >>>>> incremental progress when repairing the entire table. This is 
> > >>>>> especially helpful when there are down nodes or topology changes 
> > >>>>> during repair, which are common in day-to-day operations.
> > >>>>> 
> > >>>>> Coordination across all nodes is required over a long period of time. 
> > >>>>> For each round of repair, if all replica nodes are down or if there 
> > >>>>> is a topology change, the data ranges that were not covered will need 
> > >>>>> to be repaired in the next round.
> > >>>>> 
> > >>>>> 
> > >>>>> 5. Validating data used in reads directly
> > >>>>> 
> > >>>>> Verifies index content, not actual data—may miss low-probability 
> > >>>>> errors like bit flips
> > >>>>> 
> > >>>>> Verifies actual data content, providing stronger correctness 
> > >>>>> guarantees
> > >>>>> 
> > >>>>> 6. Extra data scan during inconsistency detection
> > >>>>> 
> > >>>>> Since the data covered by certain indexes is not guaranteed to be 
> > >>>>> fully contained within a single node as the topology changes, some 
> > >>>>> data scans may be wasted.
> > >>>>> 
> > >>>>> No extra data scan
> > >>>>> 
> > >>>>> 7. The overhead of actual data repair after an inconsistency is 
> > >>>>> detected
> > >>>>> 
> > >>>>> Only indexes are streamed to the base table node, and the actual data 
> > >>>>> being fixed can be as accurate as the row level.
> > >>>>> 
> > >>>>> Anti-compaction is needed on the MV table, and potential 
> > >>>>> over-streaming may occur due to the lack of row-level insight into 
> > >>>>> data quality.
> > >>>>> 
> > >>>>> 
> > >>>>> > one of my biggest concerns I haven't seen discussed much is 
> > >>>>> > LOCAL_SERIAL/SERIAL on read
> > >>>>> 
> > >>>>> Paxos v2 introduces an optimization where serial reads can be 
> > >>>>> completed in just one round trip, reducing latency compared to 
> > >>>>> traditional Paxos which may require multiple phases.
> > >>>>> 
> > >>>>> > I think a refresh would be low-cost and give users the flexibility 
> > >>>>> > to run them however they want.
> > >>>>> 
> > >>>>> I think this is an interesting idea. Does it suggest that the MV 
> > >>>>> should be rebuilt on a regular schedule? It sounds like an extension 
> > >>>>> of the snapshot-based approach—rather than detecting mismatches, we 
> > >>>>> would periodically reconstruct a clean version of the MV based on the 
> > >>>>> snapshot. This seems to diverge from the current MV model in 
> > >>>>> Cassandra, where consistency between the MV and base table must be 
> > >>>>> maintained continuously. This could be an extension of the CEP-48 
> > >>>>> work, where the MV is periodically rebuilt from a snapshot of the 
> > >>>>> base table, assuming the user can tolerate some level of staleness in 
> > >>>>> the MV data.
> > >>>>> 
> > >>> 
> > 
>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to