Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Blake Eggleston Tue, 10 Jun 2025 10:02:54 -0700

>  Extra row in MV (assuming the tombstone is gone in the base table) — how 
> should we fix this?
> 
> 
>


This would mean that the base table had either updated or deleted a row and the 
view didn't receive the corresponding delete. 

In the case of a missed update, we'll have a new value and we can send a 
tombstone to the view with the timestamp of the most recent update. Since 
timestamps issued by paxos and accord writes are always increasing 
monotonically and don't have collisions, this is safe. 

In the case of a row deletion, we'd also want to send a tombstone with the same 
timestamp, however since tombstones can be purged, we may not have that 
information and would have to treat it like the view has a higher timestamp 
than the base table.

> Inconsistency (timestamps don’t match) — it’s easy to fix when the base table 
> has higher timestamps, but how do we resolve it when the MV columns have 
> higher timestamps?
> 

There are 2 ways this could happen. First is that a write failed and paxos 
repair hasn't completed it, which is expected, and the second is a replication 
bug or base table data loss. You'd need to compare the view timestamp to the 
paxos repair history to tell which it is. If the view timestamp is higher than 
the most recent paxos repair timestamp for the key, then it may just be a 
failed write and we should do nothing. If the view timestamp is less than the 
most recent paxos repair timestamp for that key and higher than the base 
timestamp, then something has gone wrong and we should issue a tombstone using 
the paxos repair timestamp as the tombstone timestamp. This is safe to do 
because the paxos repair timestamps act as a low bound for ballots paxos will 
process, so it wouldn't be possible for a legitimate write to be shadowed by 
this tombstone.

> Do we need to introduce a new kind of tombstone to shadow the rows in the MV 
> for cases 2 and 3? If yes, how will this tombstone work? If no, how should we 
> fix the MV data?
> 

No, a normal tombstone would work.

On Tue, Jun 10, 2025, at 2:42 AM, Runtian Liu wrote:
> Okay, let’s put the efficiency discussion on hold for now. I want to make 
> sure the actual repair process after detecting inconsistencies will work with 
> the index-based solution.
> 
> When a mismatch is detected, the MV replica will need to stream its index 
> file to the base table replica. The base table will then perform a comparison 
> between the two files.
> 
> There are three cases we need to handle:
> 
>  1. Missing row in MV — this is straightforward; we can propagate the data to 
> the MV.
> 
>  2. Extra row in MV (assuming the tombstone is gone in the base table) — how 
> should we fix this?
> 
>  3. Inconsistency (timestamps don’t match) — it’s easy to fix when the base 
> table has higher timestamps, but how do we resolve it when the MV columns 
> have higher timestamps?
> 
> Do we need to introduce a new kind of tombstone to shadow the rows in the MV 
> for cases 2 and 3? If yes, how will this tombstone work? If no, how should we 
> fix the MV data?
> 
> 
> On Mon, Jun 9, 2025 at 11:00 AM Blake Eggleston <[email protected]> wrote:
>> __
>> > hopefully we can come up with a solution that everyone agrees on.
>> 
>> I’m sure we can, I think we’ve been making good progress
>> 
>> > My main concern with the index-based solution is the overhead it adds to 
>> > the hot path, as well as having to build indexes periodically.
>> 
>> So the additional overhead of maintaining a storage attached index on the 
>> client write path is pretty minimal - it’s basically adding data to an in 
>> memory trie. It’s a little extra work and memory usage, but there isn’t any 
>> extra io or other blocking associated with it. I’d expect the latency impact 
>> to be negligible.
>> 
>> > As mentioned earlier, this MV repair should be an infrequent operation
>> 
>> I don’t this that’s a safe assumption. There are a lot of situations outside 
>> of data loss bugs where repair would need to be run. 
>> 
>> These use cases could probably be handled by repairing the view with other 
>> view replicas:
>> 
>> Scrubbing corrupt sstables
>> Node replacement via backup
>> 
>> These use cases would need an actual MV repair to check consistency with the 
>> base table:
>> 
>> Restoring a cluster from a backup
>> Imported sstables via nodetool import
>> Data loss from operator error
>> Proactive consistency checks - ie preview repairs
>> 
>> Even if it is an infrequent operation, when operators need it, it needs to 
>> be available and reliable.
>> 
>> It’s a fact that there are clusters where non-incremental repairs are run on 
>> a cadence of a week or more to manage the overhead of validation 
>> compactions. Assuming the cluster doesn’t have any additional headroom, that 
>> would mean that any one of the above events could cause views to remain out 
>> of sync for up to a week while the full set of merkle trees is being built.
>> 
>> This delay eliminates a lot of the value of repair as a risk mitigation 
>> tool. If I had to make a recommendation where a bad call could cost me my 
>> job, the prospect of a 7 day delay on repair would mean a strong no.
>> 
>> Some users also run preview repair continuously to detect data consistency 
>> errors, so at least a subset of users will probably be running MV repairs 
>> continuously - at least in preview mode.
>> 
>> That’s why I say that the replication path should be designed to never need 
>> repair, and MV repair should be designed to be prepared for the worst.
>> 
>> > I’m wondering if it’s possible to enable or disable index building 
>> > dynamically so that we don’t always incur the cost for something that’s 
>> > rarely needed.
>> 
>> I think this would be a really reasonable compromise as long as the default 
>> is on. That way it’s as safe as possible by default, but users who don’t 
>> care or have a separate system for repairing MVs can opt out.
>> 
>> > I’m not sure what you mean by “data problems” here.
>> 
>> I mean out of sync views - either due to bugs, operator error, corruption, 
>> etc
>> 
>> > Also, this does scale with cluster size—I’ve compared it to full repair, 
>> > and this MV repair should behave similarly. That means as long as full 
>> > repair works, this repair should work as well.
>> 
>> You could build the merkle trees at about the same cost as a full repair, 
>> but the actual data repair path is completely different for MV, and that’s 
>> the part that doesn’t scale well. As you know, with normal repair, we just 
>> stream data for ranges detected as out of sync. For Mvs, since the data 
>> isn’t in base partition order, the view data for an out of sync view range 
>> needs to be read out and streamed to every base replica that it’s detected a 
>> mismatch against. So in the example I gave with the 300 node cluster, you’re 
>> looking at reading and transmitting the same partition at least 100 times in 
>> the best case, and the cost of this keeps going up as the cluster increases 
>> in size. That's the part that doesn't scale well. 
>> 
>> This is also one the benefits of the index design. Since it stores data in 
>> segments that roughly correspond to points on the grid, you’re not rereading 
>> the same data over and over. A repair for a given grid point only reads an 
>> amount of data proportional to the data in common for the base/view grid 
>> point, and it’s stored in a small enough granularity that the base can 
>> calculate what data needs to be sent to the view without having to read the 
>> entire view partition.
>> 
>> On Sat, Jun 7, 2025, at 7:42 PM, Runtian Liu wrote:
>>> Thanks, Blake. I’m open to iterating on the design, and hopefully we can 
>>> come up with a solution that everyone agrees on.
>>> 
>>> My main concern with the index-based solution is the overhead it adds to 
>>> the hot path, as well as having to build indexes periodically. As mentioned 
>>> earlier, this MV repair should be an infrequent operation, but the 
>>> index-based approach shifts some of the work to the hot path in order to 
>>> allow repairs that touch only a few nodes.
>>> 
>>> I’m wondering if it’s possible to enable or disable index building 
>>> dynamically so that we don’t always incur the cost for something that’s 
>>> rarely needed.
>>> 
>>> > it degrades operators ability to react to data problems by imposing a 
>>> > significant upfront processing burden on repair, and that it doesn’t 
>>> > scale well with cluster size
>>> 
>>> I’m not sure what you mean by “data problems” here. Also, this does scale 
>>> with cluster size—I’ve compared it to full repair, and this MV repair 
>>> should behave similarly. That means as long as full repair works, this 
>>> repair should work as well.
>>> 
>>> For example, regardless of how large the cluster is, you can always enable 
>>> Merkle tree building on 10% of the nodes at a time until all the trees are 
>>> ready.
>>> 
>>> I understand that coordinating this type of repair is harder than what we 
>>> currently support, but with CEP-37, we should be able to handle this 
>>> coordination without adding too much burden on the operator side.
>>> 
>>> 
>>> On Sat, Jun 7, 2025 at 8:28 AM Blake Eggleston <[email protected]> wrote:
>>>> __
>>>>> I don't see any outcome here that is good for the community though. 
>>>>> Either Runtian caves and adopts your design that he (and I) consider 
>>>>> inferior, or he is prevented from contributing this work.
>>>> 
>>>> Hey Runtian, fwiw, these aren't the only 2 options. This isn’t a 
>>>> competition. We can collaborate and figure out the best approach to the 
>>>> problem. I’d like to keep discussing it if you’re open to iterating on the 
>>>> design.
>>>> 
>>>> I’m not married to our proposal, it’s just the cleanest way we could think 
>>>> of to address what Jon and I both see as blockers in the current proposal. 
>>>> It’s not set in stone though.
>>>> 
>>>> On Fri, Jun 6, 2025, at 1:32 PM, Benedict Elliott Smith wrote:
>>>>> Hmm, I am very surprised as I helped write that and I distinctly recall a 
>>>>> specific goal was avoiding binding vetoes as they're so toxic.
>>>>> 
>>>>> Ok, I guess you can block this work if you like.
>>>>> 
>>>>> I don't see any outcome here that is good for the community though. 
>>>>> Either Runtian caves and adopts your design that he (and I) consider 
>>>>> inferior, or he is prevented from contributing this work. That isn't a 
>>>>> functioning community in my mind, so I'll be noping out for a while, as I 
>>>>> don't see much value here right now.
>>>>> 
>>>>> 
>>>>> On 2025/06/06 18:31:08 Blake Eggleston wrote:
>>>>> > Hi Benedict, that’s actually not true. 
>>>>> > 
>>>>> > Here’s a link to the project governance page: 
>>>>> > _https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Project+Governance_
>>>>> > 
>>>>> > The CEP section says:
>>>>> > 
>>>>> > “*Once the proposal is finalized and any major committer dissent 
>>>>> > reconciled, call a [VOTE] on the ML to have the proposal adopted. The 
>>>>> > criteria for acceptance is consensus (3 binding +1 votes and no binding 
>>>>> > vetoes). The vote should remain open for 72 hours.*”
>>>>> > 
>>>>> > So they’re definitely vetoable. 
>>>>> > 
>>>>> > Also note the part about “*Once the proposal is finalized and any major 
>>>>> > committer dissent reconciled,*” being a prerequisite for moving a CEP 
>>>>> > to [VOTE]. Given the as yet unreconciled committer dissent, it wouldn’t 
>>>>> > even be appropriate to move to a VOTE until we get to the bottom of 
>>>>> > this repair discussion.
>>>>> > 
>>>>> > On Fri, Jun 6, 2025, at 12:31 AM, Benedict Elliott Smith wrote:
>>>>> > > > but the snapshot repair design is not a viable path forward. It’s 
>>>>> > > > the first iteration of a repair design. We’ve proposed a second 
>>>>> > > > iteration, and we’re open to a third iteration.
>>>>> > > 
>>>>> > > I shan't be participating further in discussion, but I want to make a 
>>>>> > > point of order. The CEP process has no vetoes, so you are not 
>>>>> > > empowered to declare that a design is not viable without the input of 
>>>>> > > the wider community.
>>>>> > > 
>>>>> > > 
>>>>> > > On 2025/06/05 03:58:59 Blake Eggleston wrote:
>>>>> > > > You can detect and fix the mismatch in a single round of repair, 
>>>>> > > > but the amount of work needed to do it is _significantly_ higher 
>>>>> > > > with snapshot repair. Consider a case where we have a 300 node 
>>>>> > > > cluster w/ RF 3, where each view partition contains entries mapping 
>>>>> > > > to every token range in the cluster - so 100 ranges. If we lose a 
>>>>> > > > view sstable, it will affect an entire row/column of the grid. 
>>>>> > > > Repair is going to scan all data in the mismatching view token 
>>>>> > > > ranges 100 times, and each base range once. So you’re looking at 
>>>>> > > > 200 range scans.
>>>>> > > > 
>>>>> > > > Now, you may argue that you can merge the duplicate view scans into 
>>>>> > > > a single scan while you repair all token ranges in parallel. I’m 
>>>>> > > > skeptical that’s going to be achievable in practice, but even if it 
>>>>> > > > is, we’re now talking about the view replica hypothetically doing a 
>>>>> > > > pairwise repair with every other replica in the cluster at the same 
>>>>> > > > time. Neither of these options is workable.
>>>>> > > > 
>>>>> > > > Let’s take a step back though, because I think we’re getting lost 
>>>>> > > > in the weeds.
>>>>> > > > 
>>>>> > > > The repair design in the CEP has some high level concepts that make 
>>>>> > > > a lot of sense, the idea of repairing a grid is really smart. 
>>>>> > > > However, it has some significant drawbacks that remain unaddressed. 
>>>>> > > > I want this CEP to succeed, and I know Jon does too, but the 
>>>>> > > > snapshot repair design is not a viable path forward. It’s the first 
>>>>> > > > iteration of a repair design. We’ve proposed a second iteration, 
>>>>> > > > and we’re open to a third iteration. This part of the CEP process 
>>>>> > > > is meant to identify and address shortcomings, I don’t think that 
>>>>> > > > continuing to dissect the snapshot repair design is making progress 
>>>>> > > > in that direction.
>>>>> > > > 
>>>>> > > > On Wed, Jun 4, 2025, at 2:04 PM, Runtian Liu wrote:
>>>>> > > > > >  We potentially have to do it several times on each node, 
>>>>> > > > > > depending on the size of the range. Smaller ranges increase the 
>>>>> > > > > > size of the board exponentially, larger ranges increase the 
>>>>> > > > > > number of SSTables that would be involved in each compaction.
>>>>> > > > > As described in the CEP example, this can be handled in a single 
>>>>> > > > > round of repair. We first identify all the points in the grid 
>>>>> > > > > that require repair, then perform anti-compaction and stream data 
>>>>> > > > > based on a second scan over those identified points. This applies 
>>>>> > > > > to the snapshot-based solution—without an index, repairing a 
>>>>> > > > > single point in that grid requires scanning the entire base table 
>>>>> > > > > partition (token range). In contrast, with the index-based 
>>>>> > > > > solution—as in the example you referenced—if a large block of 
>>>>> > > > > data is corrupted, even though the index is used for comparison, 
>>>>> > > > > many key mismatches may occur. This can lead to random disk 
>>>>> > > > > access to the original data files, which could cause performance 
>>>>> > > > > issues. For the case you mentioned for snapshot based solution, 
>>>>> > > > > it should not take months to repair all the data, instead one 
>>>>> > > > > round of repair should be enough. The actual repair phase is 
>>>>> > > > > split from the detection phase.
>>>>> > > > > 
>>>>> > > > > 
>>>>> > > > > On Thu, Jun 5, 2025 at 12:12 AM Jon Haddad 
>>>>> > > > > <[email protected]> wrote:
>>>>> > > > >> > This isn’t really the whole story. The amount of wasted scans 
>>>>> > > > >> > on index repairs is negligible. If a difference is detected 
>>>>> > > > >> > with snapshot repairs though, you have to read the entire 
>>>>> > > > >> > partition from both the view and base table to calculate what 
>>>>> > > > >> > needs to be fixed.
>>>>> > > > >> 
>>>>> > > > >> You nailed it.
>>>>> > > > >> 
>>>>> > > > >> When the base table is converted to a view, and sent to the 
>>>>> > > > >> view, the information we have is that one of the view's 
>>>>> > > > >> partition keys needs a repair.  That's going to be different 
>>>>> > > > >> from the partition key of the base table.  As a result, on the 
>>>>> > > > >> base table, for each affected range, we'd have to issue another 
>>>>> > > > >> compaction across the entire set of sstables that could have the 
>>>>> > > > >> data the view needs (potentially many GB), in order to send over 
>>>>> > > > >> the corrected version of the partition, then send it over to the 
>>>>> > > > >> view.  Without an index in place, we have to do yet another 
>>>>> > > > >> scan, per-affected range.  
>>>>> > > > >> 
>>>>> > > > >> Consider the case of a single corrupted SSTable on the view 
>>>>> > > > >> that's removed from the filesystem, or the data is simply 
>>>>> > > > >> missing after being restored from an inconsistent backup.  It 
>>>>> > > > >> presumably contains lots of partitions, which maps to base 
>>>>> > > > >> partitions all over the cluster, in a lot of different token 
>>>>> > > > >> ranges.  For every one of those ranges (hundreds, to tens of 
>>>>> > > > >> thousands of them given the checkerboard design), when finding 
>>>>> > > > >> the missing data in the base, you'll have to perform a 
>>>>> > > > >> compaction across all the SSTables that potentially contain the 
>>>>> > > > >> missing data just to rebuild the view-oriented partitions that 
>>>>> > > > >> need to be sent to the view.  The complexity of this operation 
>>>>> > > > >> can be looked at as O(N*M) where N and M are the number of 
>>>>> > > > >> ranges in the base table and the view affected by the 
>>>>> > > > >> corruption, respectively.  Without an index in place, finding 
>>>>> > > > >> the missing data is very expensive.  We potentially have to do 
>>>>> > > > >> it several times on each node, depending on the size of the 
>>>>> > > > >> range.  Smaller ranges increase the size of the board 
>>>>> > > > >> exponentially, larger ranges increase the number of SSTables 
>>>>> > > > >> that would be involved in each compaction.
>>>>> > > > >> 
>>>>> > > > >> Then you send that data over to the view, the view does it's 
>>>>> > > > >> anti-compaction thing, again, once per affected range.  So now 
>>>>> > > > >> the view has to do an anti-compaction once per block on the 
>>>>> > > > >> board that's affected by the missing data.  
>>>>> > > > >> 
>>>>> > > > >> Doing hundreds or thousands of these will add up pretty quickly.
>>>>> > > > >> 
>>>>> > > > >> When I said that a repair could take months, this is what I had 
>>>>> > > > >> in mind.
>>>>> > > > >> 
>>>>> > > > >> 
>>>>> > > > >> 
>>>>> > > > >> 
>>>>> > > > >> On Tue, Jun 3, 2025 at 11:10 AM Blake Eggleston 
>>>>> > > > >> <[email protected]> wrote:
>>>>> > > > >>> __
>>>>> > > > >>> > Adds overhead in the hot path due to maintaining indexes. 
>>>>> > > > >>> > Extra memory needed during write path and compaction.
>>>>> > > > >>> 
>>>>> > > > >>> I’d make the same argument about the overhead of maintaining 
>>>>> > > > >>> the index that Jon just made about the disk space required. The 
>>>>> > > > >>> relatively predictable overhead of maintaining the index as 
>>>>> > > > >>> part of the write and compaction paths is a pro, not a con. 
>>>>> > > > >>> Although you’re not always paying the cost of building a merkle 
>>>>> > > > >>> tree with snapshot repair, it can impact the hot path and you 
>>>>> > > > >>> do have to plan for it.
>>>>> > > > >>> 
>>>>> > > > >>> > Verifies index content, not actual data—may miss 
>>>>> > > > >>> > low-probability errors like bit flips
>>>>> > > > >>> 
>>>>> > > > >>> Presumably this could be handled by the views performing repair 
>>>>> > > > >>> against each other? You could also periodically rebuild the 
>>>>> > > > >>> index or perform checksums against the sstable content.
>>>>> > > > >>> 
>>>>> > > > >>> > Extra data scan during inconsistency detection
>>>>> > > > >>> > Index: Since the data covered by certain indexes is not 
>>>>> > > > >>> > guaranteed to be fully contained within a single node as the 
>>>>> > > > >>> > topology changes, some data scans may be wasted.
>>>>> > > > >>> > Snapshots: No extra data scan
>>>>> > > > >>> 
>>>>> > > > >>> This isn’t really the whole story. The amount of wasted scans 
>>>>> > > > >>> on index repairs is negligible. If a difference is detected 
>>>>> > > > >>> with snapshot repairs though, you have to read the entire 
>>>>> > > > >>> partition from both the view and base table to calculate what 
>>>>> > > > >>> needs to be fixed.
>>>>> > > > >>> 
>>>>> > > > >>> On Tue, Jun 3, 2025, at 10:27 AM, Jon Haddad wrote:
>>>>> > > > >>>> One practical aspect that isn't immediately obvious is the 
>>>>> > > > >>>> disk space consideration for snapshots.
>>>>> > > > >>>> 
>>>>> > > > >>>> When you have a table with a mixed workload using LCS or UCS 
>>>>> > > > >>>> with scaling parameters like L10 and initiate a repair, the 
>>>>> > > > >>>> disk usage will increase as long as the snapshot persists and 
>>>>> > > > >>>> the table continues to receive writes. This aspect is 
>>>>> > > > >>>> understood and factored into the design.
>>>>> > > > >>>> 
>>>>> > > > >>>> However, a more nuanced point is the necessity to maintain 
>>>>> > > > >>>> sufficient disk headroom specifically for running repairs. 
>>>>> > > > >>>> This echoes the challenge with STCS compaction, where enough 
>>>>> > > > >>>> space must be available to accommodate the largest SSTables, 
>>>>> > > > >>>> even when they are not being actively compacted.
>>>>> > > > >>>> 
>>>>> > > > >>>> For example, if a repair involves rewriting 100GB of SSTable 
>>>>> > > > >>>> data, you'll consistently need to reserve 100GB of free space 
>>>>> > > > >>>> to facilitate this.
>>>>> > > > >>>> 
>>>>> > > > >>>> Therefore, while the snapshot-based approach leads to variable 
>>>>> > > > >>>> disk space utilization, operators must provision storage as if 
>>>>> > > > >>>> the maximum potential space will be used at all times to 
>>>>> > > > >>>> ensure repairs can be executed.
>>>>> > > > >>>> 
>>>>> > > > >>>> This introduces a rate of churn dynamic, where the write 
>>>>> > > > >>>> throughput dictates the required extra disk space, rather than 
>>>>> > > > >>>> the existing on-disk data volume.
>>>>> > > > >>>> 
>>>>> > > > >>>> If 50% of your SSTables are rewritten during a snapshot, you 
>>>>> > > > >>>> would need 50% free disk space. Depending on the workload, the 
>>>>> > > > >>>> snapshot method could consume significantly more disk space 
>>>>> > > > >>>> than an index-based approach. Conversely, for relatively 
>>>>> > > > >>>> static workloads, the index method might require more space. 
>>>>> > > > >>>> It's not as straightforward as stating "No extra disk space 
>>>>> > > > >>>> needed".
>>>>> > > > >>>> 
>>>>> > > > >>>> Jon
>>>>> > > > >>>> 
>>>>> > > > >>>> On Mon, Jun 2, 2025 at 2:49 PM Runtian Liu 
>>>>> > > > >>>> <[email protected]> wrote:
>>>>> > > > >>>>> > Regarding your comparison between approaches, I think you 
>>>>> > > > >>>>> > also need to take into account the other dimensions that 
>>>>> > > > >>>>> > have been brought up in this thread. Things like minimum 
>>>>> > > > >>>>> > repair times and vulnerability to outages and topology 
>>>>> > > > >>>>> > changes are the first that come to mind.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Sure, I added a few more points.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> *Perspective*
>>>>> > > > >>>>> 
>>>>> > > > >>>>> *Index-Based Solution*
>>>>> > > > >>>>> 
>>>>> > > > >>>>> *Snapshot-Based Solution*
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 1. Hot path overhead
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Adds overhead in the hot path due to maintaining indexes. 
>>>>> > > > >>>>> Extra memory needed during write path and compaction.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> No impact on the hot path
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 2. Extra disk usage when repair is not running
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Requires additional disk space to store persistent indexes
>>>>> > > > >>>>> 
>>>>> > > > >>>>> No extra disk space needed
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 3. Extra disk usage during repair
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Minimal or no additional disk usage
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Requires additional disk space for snapshots
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 4. Fine-grained repair  to deal with emergency situations / 
>>>>> > > > >>>>> topology changes
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Supports fine-grained repairs by targeting specific index 
>>>>> > > > >>>>> ranges. This allows repair to be retried on smaller data 
>>>>> > > > >>>>> sets, enabling incremental progress when repairing the entire 
>>>>> > > > >>>>> table. This is especially helpful when there are down nodes 
>>>>> > > > >>>>> or topology changes during repair, which are common in 
>>>>> > > > >>>>> day-to-day operations.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Coordination across all nodes is required over a long period 
>>>>> > > > >>>>> of time. For each round of repair, if all replica nodes are 
>>>>> > > > >>>>> down or if there is a topology change, the data ranges that 
>>>>> > > > >>>>> were not covered will need to be repaired in the next round.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 5. Validating data used in reads directly
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Verifies index content, not actual data—may miss 
>>>>> > > > >>>>> low-probability errors like bit flips
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Verifies actual data content, providing stronger correctness 
>>>>> > > > >>>>> guarantees
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 6. Extra data scan during inconsistency detection
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Since the data covered by certain indexes is not guaranteed 
>>>>> > > > >>>>> to be fully contained within a single node as the topology 
>>>>> > > > >>>>> changes, some data scans may be wasted.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> No extra data scan
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 7. The overhead of actual data repair after an inconsistency 
>>>>> > > > >>>>> is detected
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Only indexes are streamed to the base table node, and the 
>>>>> > > > >>>>> actual data being fixed can be as accurate as the row level.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Anti-compaction is needed on the MV table, and potential 
>>>>> > > > >>>>> over-streaming may occur due to the lack of row-level insight 
>>>>> > > > >>>>> into data quality.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> 
>>>>> > > > >>>>> > one of my biggest concerns I haven't seen discussed much is 
>>>>> > > > >>>>> > LOCAL_SERIAL/SERIAL on read
>>>>> > > > >>>>> 
>>>>> > > > >>>>> Paxos v2 introduces an optimization where serial reads can be 
>>>>> > > > >>>>> completed in just one round trip, reducing latency compared 
>>>>> > > > >>>>> to traditional Paxos which may require multiple phases.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> > I think a refresh would be low-cost and give users the 
>>>>> > > > >>>>> > flexibility to run them however they want.
>>>>> > > > >>>>> 
>>>>> > > > >>>>> I think this is an interesting idea. Does it suggest that the 
>>>>> > > > >>>>> MV should be rebuilt on a regular schedule? It sounds like an 
>>>>> > > > >>>>> extension of the snapshot-based approach—rather than 
>>>>> > > > >>>>> detecting mismatches, we would periodically reconstruct a 
>>>>> > > > >>>>> clean version of the MV based on the snapshot. This seems to 
>>>>> > > > >>>>> diverge from the current MV model in Cassandra, where 
>>>>> > > > >>>>> consistency between the MV and base table must be maintained 
>>>>> > > > >>>>> continuously. This could be an extension of the CEP-48 work, 
>>>>> > > > >>>>> where the MV is periodically rebuilt from a snapshot of the 
>>>>> > > > >>>>> base table, assuming the user can tolerate some level of 
>>>>> > > > >>>>> staleness in the MV data.
>>>>> > > > >>>>> 
>>>>> > > > >>> 
>>>>> > > > 
>>>>> > > 
>>>>> > 
>>>>> 
>>>> 
>>

Re: [DISCUSS] CEP-48: First-Class Materialized View Support

Reply via email to