It feels uncomfortable asking users to rely on a third party that’s as 
heavy-weight as spark to use a built-in feature.

Can we really not do this internally? I get that the obvious way with merkle 
trees is hard because the range fanout of the MV using a different partitioner, 
but have we tried to think up a way to do this (somewhat efficiently) within 
the db? 


> On Dec 6, 2024, at 9:08 AM, James Berragan <jberra...@gmail.com> wrote:
> 
> I think this would be useful and - having never really used Materialized 
> Views - I didn't know it was an issue for some users. I would say the 
> Cassandra Analytics library (http://github.com/apache/cassandra-analytics/) 
> could be utilized for much of this, with a specialized Spark job for this 
> purpose.
> 
> On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia <chovatia.jayd...@gmail.com 
> <mailto:chovatia.jayd...@gmail.com>> wrote:
>> Hi,
>> 
>> NOTE: This email does not promote using Cassandra's Materialized View (MV) 
>> but assists those stuck with it for various reasons.
>> 
>> The primary issue with MV is that once it goes out of sync with the base 
>> table, no tooling is available to remediate it. This Spark job aims to fill 
>> this gap by logically comparing the MV with the base table and identifying 
>> inconsistencies. The job primarily does the following:
>> Scans Base Table (A), MV (B), and do {A}-{B} analysis
>> Categorize each record into one of the four areas: a) Consistent, b) 
>> Inconsistent, c) MissingInMV, d) MissingInBaseTable
>> Provide a detailed view of mismatches, such as the primary key, all the 
>> non-primary key fields, and mismatched columns.
>> Dumps the detailed information to an output folder path provided to the job 
>> (one can extend the interface to dump the records to some object store as 
>> well)
>> Optionally, the job fixes the MV inconsistencies.
>> Rich configuration (throttling, actionable output, capability to specify the 
>> time range for the records, etc.) to run the job at Scale in a production 
>> environment
>> Design doc: link 
>> <https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing>
>> The Git Repository: link 
>> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>
>> 
>> Motivation
>> This email's primary objective is to share with the community that something 
>> like this is available for MV (in a private repository), which may be 
>> helpful in emergencies to folks stuck with MV in production.
>> If we, as a community, want to officially foster tooling using Spark because 
>> it can be helpful to do many things beyond the MV work, such as counting 
>> rows, etc., then I am happy to drive the efforts.
>> Please let me know what you think.
>> 
>> Jaydeep

Reply via email to