Re: [DISCUSS] Tooling to repair MV through a Spark job

James Berragan Fri, 06 Dec 2024 09:09:12 -0800

I think this would be useful and - having never really used Materialized
Views - I didn't know it was an issue for some users. I would say the
Cassandra Analytics library (http://github.com/apache/cassandra-analytics/)
could be utilized for much of this, with a specialized Spark job for this
purpose.


On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia <[email protected]>
wrote:

> Hi,
>
> *NOTE: *This email does not promote using Cassandra's Materialized View
> (MV) but assists those stuck with it for various reasons.
>
> The primary issue with MV is that once it goes out of sync with the base
> table, no tooling is available to remediate it. This Spark job aims to fill
> this gap by logically comparing the MV with the base table and identifying
> inconsistencies. The job primarily does the following:
>
>    - Scans Base Table (A), MV (B), and do {A}-{B} analysis
>    - Categorize each record into one of the four areas: a) Consistent, b)
>    Inconsistent, c) MissingInMV, d) MissingInBaseTable
>    - Provide a detailed view of mismatches, such as the primary key, all
>    the non-primary key fields, and mismatched columns.
>    - Dumps the detailed information to an output folder path provided to
>    the job (one can extend the interface to dump the records to some object
>    store as well)
>    - Optionally, the job fixes the MV inconsistencies.
>    - Rich configuration (throttling, actionable output, capability to
>    specify the time range for the records, etc.) to run the job at Scale in a
>    production environment
>
> Design doc: link
> <https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing>
> The Git Repository: link
> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>
>
> *Motivation*
>
>    1. This email's primary objective is to share with the community that
>    something like this is available for MV (in a private repository), which
>    may be helpful in emergencies to folks stuck with MV in production.
>    2. If we, as a community, want to officially foster tooling using
>    Spark because it can be helpful to do many things beyond the MV work, such
>    as counting rows, etc., then I am happy to drive the efforts.
>
> Please let me know what you think.
>
> Jaydeep
>

Re: [DISCUSS] Tooling to repair MV through a Spark job

Reply via email to