Thanks for the information, Yifan and James!

Given that, we can scope this email discussion only for this specific MV repair. 
Two points:
1. Can this MV repair job provide some value addition?
2. If yes, does it even make sense to merge this MV repair tooling, which uses Spak as its underlying technology, with Cassandra Analytics?

Jaydeep

On Dec 6, 2024, at 3:58 PM, Yifan Cai <yc25c...@gmail.com> wrote:


Oh, I just noticed that James already mentioned it. 

On Fri, Dec 6, 2024 at 3:51 PM Yifan Cai <yc25c...@gmail.com> wrote:
I would like to highlight an existing tooling for "many things beyond the MV work, such as counting rows, etc." 

The Apache Cassandra Analytics project (http://github.com/apache/cassandra-analytics/) could be a great resource for this type of task. It reads directly from the SSTables in the Spark executors, which avoids sending CQL queries that cloud stress the cluster or interfere with the production traffic. 

- Yifan

On Fri, Dec 6, 2024 at 8:27 AM Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote:
Hi,

NOTE: This email does not promote using Cassandra's Materialized View (MV) but assists those stuck with it for various reasons.

The primary issue with MV is that once it goes out of sync with the base table, no tooling is available to remediate it. This Spark job aims to fill this gap by logically comparing the MV with the base table and identifying inconsistencies. The job primarily does the following:
  • Scans Base Table (A), MV (B), and do {A}-{B} analysis
  • Categorize each record into one of the four areas: a) Consistent, b) Inconsistent, c) MissingInMV, d) MissingInBaseTable
  • Provide a detailed view of mismatches, such as the primary key, all the non-primary key fields, and mismatched columns.
  • Dumps the detailed information to an output folder path provided to the job (one can extend the interface to dump the records to some object store as well)
  • Optionally, the job fixes the MV inconsistencies.
  • Rich configuration (throttling, actionable output, capability to specify the time range for the records, etc.) to run the job at Scale in a production environment
Design doc: link
The Git Repository: link

Motivation
  1. This email's primary objective is to share with the community that something like this is available for MV (in a private repository), which may be helpful in emergencies to folks stuck with MV in production.
  2. If we, as a community, want to officially foster tooling using Spark because it can be helpful to do many things beyond the MV work, such as counting rows, etc., then I am happy to drive the efforts.
Please let me know what you think.

Jaydeep

Reply via email to