Re: [DISCUSS] Tooling to repair MV through a Spark job

Jaydeep Chovatia Fri, 06 Dec 2024 14:12:13 -0800

There are two approaches I have been thinking about for MV.

*1. **Short Term (**Status Quo)*
Here, we do not improve Cassandra MV architecture such that it reduces the
data inconsistencies drastically; thus, we continually mark MV as an
experimental feature.


In this case, we can have two suboptions to make the data consistent
eventually.

   1. Use external tools (such as Spark)
   2. Do it internally within Cassandra

*2. **Long Term (**Fix MV Architecturally)*
@curly...@gmail.com <curly...@gmail.com> and I have been discussing a few
strategies to solve the fundamental issues with the current MV
architecture, such as reducing the possibility of inconsistency by an order
of magnitude. We are considering solving this within the DB itself.

tll;dr

For #1, using external frameworks, like Spark, makes it extremely easy
because we need to dump all the data in Memory for the Base table and MV to
do {A}-{B} records comparison. Not that it is not feasible within
Cassandra, but it is challenging, and we might end up creating the
mini-Spark type of application within Cassandra.
So, the current thinking is for #1; we rely on external frameworks to make
quick progress there so something is available in emergencies and divest
the energy towards #2.

Jaydeep

On Fri, Dec 6, 2024 at 10:28 AM Jeff Jirsa <jji...@gmail.com> wrote:

> It feels uncomfortable asking users to rely on a third party that’s as
> heavy-weight as spark to use a built-in feature.
>
> Can we really not do this internally? I get that the obvious way with
> merkle trees is hard because the range fanout of the MV using a different
> partitioner, but have we tried to think up a way to do this (somewhat
> efficiently) within the db?
>
>
> On Dec 6, 2024, at 9:08 AM, James Berragan <jberra...@gmail.com> wrote:
>
> I think this would be useful and - having never really used Materialized
> Views - I didn't know it was an issue for some users. I would say the
> Cassandra Analytics library (http://github.com/apache/cassandra-analytics/)
> could be utilized for much of this, with a specialized Spark job for this
> purpose.
>
> On Fri, 6 Dec 2024 at 08:26, Jaydeep Chovatia <chovatia.jayd...@gmail.com>
> wrote:
>
>> Hi,
>>
>> *NOTE: *This email does not promote using Cassandra's Materialized View
>> (MV) but assists those stuck with it for various reasons.
>>
>> The primary issue with MV is that once it goes out of sync with the base
>> table, no tooling is available to remediate it. This Spark job aims to fill
>> this gap by logically comparing the MV with the base table and identifying
>> inconsistencies. The job primarily does the following:
>>
>>    - Scans Base Table (A), MV (B), and do {A}-{B} analysis
>>    - Categorize each record into one of the four areas: a) Consistent,
>>    b) Inconsistent, c) MissingInMV, d) MissingInBaseTable
>>    - Provide a detailed view of mismatches, such as the primary key, all
>>    the non-primary key fields, and mismatched columns.
>>    - Dumps the detailed information to an output folder path provided to
>>    the job (one can extend the interface to dump the records to some object
>>    store as well)
>>    - Optionally, the job fixes the MV inconsistencies.
>>    - Rich configuration (throttling, actionable output, capability to
>>    specify the time range for the records, etc.) to run the job at Scale in a
>>    production environment
>>
>> Design doc: link
>> <https://docs.google.com/document/d/14mo_3TlKmaL3mC_Vs69k1n923CoJmVFvEFvuPAAHk4I/edit?usp=sharing>
>> The Git Repository: link
>> <https://github.com/jaydeepkumar1984/cassandra-mv-repair-spark-job>
>>
>> *Motivation*
>>
>>    1. This email's primary objective is to share with the community that
>>    something like this is available for MV (in a private repository), which
>>    may be helpful in emergencies to folks stuck with MV in production.
>>    2. If we, as a community, want to officially foster tooling using
>>    Spark because it can be helpful to do many things beyond the MV work, such
>>    as counting rows, etc., then I am happy to drive the efforts.
>>
>> Please let me know what you think.
>>
>> Jaydeep
>>
>
>

Re: [DISCUSS] Tooling to repair MV through a Spark job

Reply via email to