It feels uncomfortable asking users to rely on a third party that’s as heavy-weight as spark to use a built-in feature.
Can we really not do this internally? I get that the obvious way with merkle trees is hard because the range fanout of the MV using a different partitioner, but have we tried to think up a way to do this (somewhat efficiently) within the db? > On Dec 6, 2024, at 9:08 AM, James Berragan <jberra...@gmail.com> wrote: > > I think this would be useful and - having never really used Materialized > Views - I didn't know it was an issue for some users. I would say the > Cassandra Analytics library (http://github.com/apache/cassandra-analytics/) > could be utilized for much of this, with a specialized Spark job for this > purpose. > > On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia <chovatia.jayd...@gmail.com > <mailto:chovatia.jayd...@gmail.com>> wrote: >> Hi, >> >> NOTE: This email does not promote using Cassandra's Materialized View (MV) >> but assists those stuck with it for various reasons. >> >> The primary issue with MV is that once it goes out of sync with the base >> table, no tooling is available to remediate it. This Spark job aims to fill >> this gap by logically comparing the MV with the base table and identifying >> inconsistencies. The job primarily does the following: >> Scans Base Table (A), MV (B), and do {A}-{B} analysis >> Categorize each record into one of the four areas: a) Consistent, b) >> Inconsistent, c) MissingInMV, d) MissingInBaseTable >> Provide a detailed view of mismatches, such as the primary key, all the >> non-primary key fields, and mismatched columns. >> Dumps the detailed information to an output folder path provided to the job >> (one can extend the interface to dump the records to some object store as >> well) >> Optionally, the job fixes the MV inconsistencies. >> Rich configuration (throttling, actionable output, capability to specify the >> time range for the records, etc.) to run the job at Scale in a production >> environment >> Design doc: link >> <https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing> >> The Git Repository: link >> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job> >> >> Motivation >> This email's primary objective is to share with the community that something >> like this is available for MV (in a private repository), which may be >> helpful in emergencies to folks stuck with MV in production. >> If we, as a community, want to officially foster tooling using Spark because >> it can be helpful to do many things beyond the MV work, such as counting >> rows, etc., then I am happy to drive the efforts. >> Please let me know what you think. >> >> Jaydeep