Okay, let’s put the efficiency discussion on hold for now. I want to make
sure the actual repair process after detecting inconsistencies will work
with the index-based solution.

When a mismatch is detected, the MV replica will need to stream its index
file to the base table replica. The base table will then perform a
comparison between the two files.

There are three cases we need to handle:

   1.

   Missing row in MV — this is straightforward; we can propagate the data
   to the MV.

   2.

   Extra row in MV (assuming the tombstone is gone in the base table) — how
   should we fix this?

   3.

   Inconsistency (timestamps don’t match) — it’s easy to fix when the base
   table has higher timestamps, but how do we resolve it when the MV columns
   have higher timestamps?

Do we need to introduce a new kind of tombstone to shadow the rows in the
MV for cases 2 and 3? If yes, how will this tombstone work? If no, how
should we fix the MV data?

On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston <bl...@ultrablake.com>
wrote:

> > hopefully we can come up with a solution that everyone agrees on.
>
> I’m sure we can, I think we’ve been making good progress
>
> > My main concern with the index-based solution is the overhead it adds to
> the hot path, as well as having to build indexes periodically.
>
> So the additional overhead of maintaining a storage attached index on the
> client write path is pretty minimal - it’s basically adding data to an in
> memory trie. It’s a little extra work and memory usage, but there isn’t any
> extra io or other blocking associated with it. I’d expect the latency
> impact to be negligible.
>
> > As mentioned earlier, this MV repair should be an infrequent operation
>
> I don’t this that’s a safe assumption. There are a lot of situations
> outside of data loss bugs where repair would need to be run.
>
> These use cases could probably be handled by repairing the view with other
> view replicas:
>
> Scrubbing corrupt sstables
> Node replacement via backup
>
> These use cases would need an actual MV repair to check consistency with
> the base table:
>
> Restoring a cluster from a backup
> Imported sstables via nodetool import
> Data loss from operator error
> Proactive consistency checks - ie preview repairs
>
> Even if it is an infrequent operation, when operators need it, it needs to
> be available and reliable.
>
> It’s a fact that there are clusters where non-incremental repairs are run
> on a cadence of a week or more to manage the overhead of validation
> compactions. Assuming the cluster doesn’t have any additional headroom,
> that would mean that any one of the above events could cause views to
> remain out of sync for up to a week while the full set of merkle trees is
> being built.
>
> This delay eliminates a lot of the value of repair as a risk mitigation
> tool. If I had to make a recommendation where a bad call could cost me my
> job, the prospect of a 7 day delay on repair would mean a strong no.
>
> Some users also run preview repair continuously to detect data consistency
> errors, so at least a subset of users will probably be running MV repairs
> continuously - at least in preview mode.
>
> That’s why I say that the replication path should be designed to never
> need repair, and MV repair should be designed to be prepared for the worst.
>
> > I’m wondering if it’s possible to enable or disable index building
> dynamically so that we don’t always incur the cost for something that’s
> rarely needed.
>
> I think this would be a really reasonable compromise as long as the
> default is on. That way it’s as safe as possible by default, but users who
> don’t care or have a separate system for repairing MVs can opt out.
>
> > I’m not sure what you mean by “data problems” here.
>
> I mean out of sync views - either due to bugs, operator error, corruption,
> etc
>
> > Also, this does scale with cluster size—I’ve compared it to full repair,
> and this MV repair should behave similarly. That means as long as full
> repair works, this repair should work as well.
>
> You could build the merkle trees at about the same cost as a full repair,
> but the actual data repair path is completely different for MV, and that’s
> the part that doesn’t scale well. As you know, with normal repair, we just
> stream data for ranges detected as out of sync. For Mvs, since the data
> isn’t in base partition order, the view data for an out of sync view range
> needs to be read out and streamed to every base replica that it’s detected
> a mismatch against. So in the example I gave with the 300 node cluster,
> you’re looking at reading and transmitting the same partition at least 100
> times in the best case, and the cost of this keeps going up as the cluster
> increases in size. That's the part that doesn't scale well.
>
> This is also one the benefits of the index design. Since it stores data in
> segments that roughly correspond to points on the grid, you’re not
> rereading the same data over and over. A repair for a given grid point only
> reads an amount of data proportional to the data in common for the
> base/view grid point, and it’s stored in a small enough granularity that
> the base can calculate what data needs to be sent to the view without
> having to read the entire view partition.
>
> On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote:
>
> Thanks, Blake. I’m open to iterating on the design, and hopefully we can
> come up with a solution that everyone agrees on.
>
> My main concern with the index-based solution is the overhead it adds to
> the hot path, as well as having to build indexes periodically. As mentioned
> earlier, this MV repair should be an infrequent operation, but the
> index-based approach shifts some of the work to the hot path in order to
> allow repairs that touch only a few nodes.
>
> I’m wondering if it’s possible to enable or disable index building
> dynamically so that we don’t always incur the cost for something that’s
> rarely needed.
>
> > it degrades operators ability to react to data problems by imposing a
> significant upfront processing burden on repair, and that it doesn’t scale
> well with cluster size
>
> I’m not sure what you mean by “data problems” here. Also, this does scale
> with cluster size—I’ve compared it to full repair, and this MV repair
> should behave similarly. That means as long as full repair works, this
> repair should work as well.
>
> For example, regardless of how large the cluster is, you can always enable
> Merkle tree building on 10% of the nodes at a time until all the trees are
> ready.
>
> I understand that coordinating this type of repair is harder than what we
> currently support, but with CEP-37, we should be able to handle this
> coordination without adding too much burden on the operator side.
>
> On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston <bl...@ultrablake.com>
> wrote:
>
>
> I don't see any outcome here that is good for the community though. Either
> Runtian caves and adopts your design that he (and I) consider inferior, or
> he is prevented from contributing this work.
>
>
> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a
> competition. We can collaborate and figure out the best approach to the
> problem. I’d like to keep discussing it if you’re open to iterating on the
> design.
>
> I’m not married to our proposal, it’s just the cleanest way we could think
> of to address what Jon and I both see as blockers in the current proposal.
> It’s not set in stone though.
>
> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote:
>
> Hmm, I am very surprised as I helped write that and I distinctly recall a
> specific goal was avoiding binding vetoes as they're so toxic.
>
> Ok, I guess you can block this work if you like.
>
> I don't see any outcome here that is good for the community though. Either
> Runtian caves and adopts your design that he (and I) consider inferior, or
> he is prevented from contributing this work. That isn't a functioning
> community in my mind, so I'll be noping out for a while, as I don't see
> much value here right now.
>
>
> On 2025/06/06 18:31:08 Blake Eggleston wrote:
> > Hi Benedict, that’s actually not true.
> >
> > Here’s a link to the project governance page: _https://
> cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
> >
> > The CEP section says:
> >
> > “*Once the proposal is finalized and any major committer dissent
> reconciled, call a [VOTE] on the ML to have the proposal adopted. The
> criteria for acceptance is consensus (3 binding +1 votes and no binding
> vetoes). The vote should remain open for 72 hours.*”
> >
> > So they’re definitely vetoable.
> >
> > Also note the part about “*Once the proposal is finalized and any major
> committer dissent reconciled,*” being a prerequisite for moving a CEP to
> [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t even
> be appropriate to move to a VOTE until we get to the bottom of this repair
> discussion.
> >
> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
> > > > but the snapshot repair design is not a viable path forward. It’s
> the first iteration of a repair design. We’ve proposed a second iteration,
> and we’re open to a third iteration.
> > >
> > > I shan't be participating further in discussion, but I want to make a
> point of order. The CEP process has no vetoes, so you are not empowered to
> declare that a design is not viable without the input of the wider
> community.
> > >
> > >
> > > On 2025/06/05 03:58:59 Blake Eggleston wrote:
> > > > You can detect and fix the mismatch in a single round of repair, but
> the amount of work needed to do it is _significantly_ higher with snapshot
> repair. Consider a case where we have a 300 node cluster w/ RF 3, where
> each view partition contains entries mapping to every token range in the
> cluster - so 100 ranges. If we lose a view sstable, it will affect an
> entire row/column of the grid. Repair is going to scan all data in the
> mismatching view token ranges 100 times, and each base range once. So
> you’re looking at 200 range scans.
> > > >
> > > > Now, you may argue that you can merge the duplicate view scans into
> a single scan while you repair all token ranges in parallel. I’m skeptical
> that’s going to be achievable in practice, but even if it is, we’re now
> talking about the view replica hypothetically doing a pairwise repair with
> every other replica in the cluster at the same time. Neither of these
> options is workable.
> > > >
> > > > Let’s take a step back though, because I think we’re getting lost in
> the weeds.
> > > >
> > > > The repair design in the CEP has some high level concepts that make
> a lot of sense, the idea of repairing a grid is really smart. However, it
> has some significant drawbacks that remain unaddressed. I want this CEP to
> succeed, and I know Jon does too, but the snapshot repair design is not a
> viable path forward. It’s the first iteration of a repair design. We’ve
> proposed a second iteration, and we’re open to a third iteration. This part
> of the CEP process is meant to identify and address shortcomings, I don’t
> think that continuing to dissect the snapshot repair design is making
> progress in that direction.
> > > >
> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
> > > > > >  We potentially have to do it several times on each node,
> depending on the size of the range. Smaller ranges increase the size of the
> board exponentially, larger ranges increase the number of SSTables that
> would be involved in each compaction.
> > > > > As described in the CEP example, this can be handled in a single
> round of repair. We first identify all the points in the grid that require
> repair, then perform anti-compaction and stream data based on a second scan
> over those identified points. This applies to the snapshot-based
> solution—without an index, repairing a single point in that grid requires
> scanning the entire base table partition (token range). In contrast, with
> the index-based solution—as in the example you referenced—if a large block
> of data is corrupted, even though the index is used for comparison, many
> key mismatches may occur. This can lead to random disk access to the
> original data files, which could cause performance issues. For the case you
> mentioned for snapshot based solution, it should not take months to repair
> all the data, instead one round of repair should be enough. The actual
> repair phase is split from the detection phase.
> > > > >
> > > > >
> > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad <
> j...@rustyrazorblade.com> wrote:
> > > > >> > This isn’t really the whole story. The amount of wasted scans
> on index repairs is negligible. If a difference is detected with snapshot
> repairs though, you have to read the entire partition from both the view
> and base table to calculate what needs to be fixed.
> > > > >>
> > > > >> You nailed it.
> > > > >>
> > > > >> When the base table is converted to a view, and sent to the view,
> the information we have is that one of the view's partition keys needs a
> repair.  That's going to be different from the partition key of the base
> table.  As a result, on the base table, for each affected range, we'd have
> to issue another compaction across the entire set of sstables that could
> have the data the view needs (potentially many GB), in order to send over
> the corrected version of the partition, then send it over to the view.
> Without an index in place, we have to do yet another scan, per-affected
> range.
> > > > >>
> > > > >> Consider the case of a single corrupted SSTable on the view
> that's removed from the filesystem, or the data is simply missing after
> being restored from an inconsistent backup.  It presumably contains lots of
> partitions, which maps to base partitions all over the cluster, in a lot of
> different token ranges.  For every one of those ranges (hundreds, to tens
> of thousands of them given the checkerboard design), when finding the
> missing data in the base, you'll have to perform a compaction across all
> the SSTables that potentially contain the missing data just to rebuild the
> view-oriented partitions that need to be sent to the view.  The complexity
> of this operation can be looked at as O(N*M) where N and M are the number
> of ranges in the base table and the view affected by the corruption,
> respectively.  Without an index in place, finding the missing data is very
> expensive.  We potentially have to do it several times on each node,
> depending on the size of the range.  Smaller ranges increase the size of
> the board exponentially, larger ranges increase the number of SSTables that
> would be involved in each compaction.
> > > > >>
> > > > >> Then you send that data over to the view, the view does it's
> anti-compaction thing, again, once per affected range.  So now the view has
> to do an anti-compaction once per block on the board that's affected by the
> missing data.
> > > > >>
> > > > >> Doing hundreds or thousands of these will add up pretty quickly.
> > > > >>
> > > > >> When I said that a repair could take months, this is what I had
> in mind.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston <
> bl...@ultrablake.com> wrote:
> > > > >>> __
> > > > >>> > Adds overhead in the hot path due to maintaining indexes.
> Extra memory needed during write path and compaction.
> > > > >>>
> > > > >>> I’d make the same argument about the overhead of maintaining the
> index that Jon just made about the disk space required. The relatively
> predictable overhead of maintaining the index as part of the write and
> compaction paths is a pro, not a con. Although you’re not always paying the
> cost of building a merkle tree with snapshot repair, it can impact the hot
> path and you do have to plan for it.
> > > > >>>
> > > > >>> > Verifies index content, not actual data—may miss
> low-probability errors like bit flips
> > > > >>>
> > > > >>> Presumably this could be handled by the views performing repair
> against each other? You could also periodically rebuild the index or
> perform checksums against the sstable content.
> > > > >>>
> > > > >>> > Extra data scan during inconsistency detection
> > > > >>> > Index: Since the data covered by certain indexes is not
> guaranteed to be fully contained within a single node as the topology
> changes, some data scans may be wasted.
> > > > >>> > Snapshots: No extra data scan
> > > > >>>
> > > > >>> This isn’t really the whole story. The amount of wasted scans on
> index repairs is negligible. If a difference is detected with snapshot
> repairs though, you have to read the entire partition from both the view
> and base table to calculate what needs to be fixed.
> > > > >>>
> > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
> > > > >>>> One practical aspect that isn't immediately obvious is the disk
> space consideration for snapshots.
> > > > >>>>
> > > > >>>> When you have a table with a mixed workload using LCS or UCS
> with scaling parameters like L10 and initiate a repair, the disk usage will
> increase as long as the snapshot persists and the table continues to
> receive writes. This aspect is understood and factored into the design.
> > > > >>>>
> > > > >>>> However, a more nuanced point is the necessity to maintain
> sufficient disk headroom specifically for running repairs. This echoes the
> challenge with STCS compaction, where enough space must be available to
> accommodate the largest SSTables, even when they are not being actively
> compacted.
> > > > >>>>
> > > > >>>> For example, if a repair involves rewriting 100GB of SSTable
> data, you'll consistently need to reserve 100GB of free space to facilitate
> this.
> > > > >>>>
> > > > >>>> Therefore, while the snapshot-based approach leads to variable
> disk space utilization, operators must provision storage as if the maximum
> potential space will be used at all times to ensure repairs can be executed.
> > > > >>>>
> > > > >>>> This introduces a rate of churn dynamic, where the write
> throughput dictates the required extra disk space, rather than the existing
> on-disk data volume.
> > > > >>>>
> > > > >>>> If 50% of your SSTables are rewritten during a snapshot, you
> would need 50% free disk space. Depending on the workload, the snapshot
> method could consume significantly more disk space than an index-based
> approach. Conversely, for relatively static workloads, the index method
> might require more space. It's not as straightforward as stating "No extra
> disk space needed".
> > > > >>>>
> > > > >>>> Jon
> > > > >>>>
> > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu <curly...@gmail.com>
> wrote:
> > > > >>>>> > Regarding your comparison between approaches, I think you
> also need to take into account the other dimensions that have been brought
> up in this thread. Things like minimum repair times and vulnerability to
> outages and topology changes are the first that come to mind.
> > > > >>>>>
> > > > >>>>> Sure, I added a few more points.
> > > > >>>>>
> > > > >>>>> *Perspective*
> > > > >>>>>
> > > > >>>>> *Index-Based Solution*
> > > > >>>>>
> > > > >>>>> *Snapshot-Based Solution*
> > > > >>>>>
> > > > >>>>> 1. Hot path overhead
> > > > >>>>>
> > > > >>>>> Adds overhead in the hot path due to maintaining indexes.
> Extra memory needed during write path and compaction.
> > > > >>>>>
> > > > >>>>> No impact on the hot path
> > > > >>>>>
> > > > >>>>> 2. Extra disk usage when repair is not running
> > > > >>>>>
> > > > >>>>> Requires additional disk space to store persistent indexes
> > > > >>>>>
> > > > >>>>> No extra disk space needed
> > > > >>>>>
> > > > >>>>> 3. Extra disk usage during repair
> > > > >>>>>
> > > > >>>>> Minimal or no additional disk usage
> > > > >>>>>
> > > > >>>>> Requires additional disk space for snapshots
> > > > >>>>>
> > > > >>>>> 4. Fine-grained repair  to deal with emergency situations /
> topology changes
> > > > >>>>>
> > > > >>>>> Supports fine-grained repairs by targeting specific index
> ranges. This allows repair to be retried on smaller data sets, enabling
> incremental progress when repairing the entire table. This is especially
> helpful when there are down nodes or topology changes during repair, which
> are common in day-to-day operations.
> > > > >>>>>
> > > > >>>>> Coordination across all nodes is required over a long period
> of time. For each round of repair, if all replica nodes are down or if
> there is a topology change, the data ranges that were not covered will need
> to be repaired in the next round.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> 5. Validating data used in reads directly
> > > > >>>>>
> > > > >>>>> Verifies index content, not actual data—may miss
> low-probability errors like bit flips
> > > > >>>>>
> > > > >>>>> Verifies actual data content, providing stronger correctness
> guarantees
> > > > >>>>>
> > > > >>>>> 6. Extra data scan during inconsistency detection
> > > > >>>>>
> > > > >>>>> Since the data covered by certain indexes is not guaranteed to
> be fully contained within a single node as the topology changes, some data
> scans may be wasted.
> > > > >>>>>
> > > > >>>>> No extra data scan
> > > > >>>>>
> > > > >>>>> 7. The overhead of actual data repair after an inconsistency
> is detected
> > > > >>>>>
> > > > >>>>> Only indexes are streamed to the base table node, and the
> actual data being fixed can be as accurate as the row level.
> > > > >>>>>
> > > > >>>>> Anti-compaction is needed on the MV table, and potential
> over-streaming may occur due to the lack of row-level insight into data
> quality.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> > one of my biggest concerns I haven't seen discussed much is
> LOCAL_SERIAL/SERIAL on read
> > > > >>>>>
> > > > >>>>> Paxos v2 introduces an optimization where serial reads can be
> completed in just one round trip, reducing latency compared to traditional
> Paxos which may require multiple phases.
> > > > >>>>>
> > > > >>>>> > I think a refresh would be low-cost and give users the
> flexibility to run them however they want.
> > > > >>>>>
> > > > >>>>> I think this is an interesting idea. Does it suggest that the
> MV should be rebuilt on a regular schedule? It sounds like an extension of
> the snapshot-based approach—rather than detecting mismatches, we would
> periodically reconstruct a clean version of the MV based on the snapshot.
> This seems to diverge from the current MV model in Cassandra, where
> consistency between the MV and base table must be maintained continuously.
> This could be an extension of the CEP-48 work, where the MV is periodically
> rebuilt from a snapshot of the base table, assuming the user can tolerate
> some level of staleness in the MV data.
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
>
>
>
>

Reply via email to