[
https://issues.apache.org/jira/browse/CASSANDRA-15399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487942#comment-17487942
]
Zhao Yang commented on CASSANDRA-15399:
---------------------------------------
There are `repairHistory` and `parentRepairHistory` distributed tables, have
you considered enhancing existing tables?
> Add ability to track state in repair
> ------------------------------------
>
> Key: CASSANDRA-15399
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15399
> Project: Cassandra
> Issue Type: Improvement
> Components: Consistency/Repair
> Reporter: David Capwell
> Assignee: David Capwell
> Priority: Normal
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> To enhance the visibility in repair, we should expose internal state via
> virtual tables; the state should include coordinator as well as participant
> state (validation, sync, etc.)
> I propose the following tables:
> repairs - high level summary of the global state of repair; this should be
> called on the coordinator.
> {code:sql}
> CREATE TABLE repairs (
> id uuid,
> keyspace_name text,
> table_names frozen<list<text>>,
> ranges frozen<list<text>>,
> coordinator text,
> participants frozen<list<text>>,
> state text,
> progress_percentage float,
> last_updated_at_millis bigint,
> duration_micro bigint,
> failure_cause text,
> PRIMARY KEY ( (id) )
> )
> {code}
> repair_tasks - represents RepairJob and participants state. This will show
> if validations are running on participants and the progress they are making;
> this should be called on the coordinator.
> {code:sql}
> CREATE TABLE repair_tasks (
> id uuid,
> session_id uuid,
> keyspace_name text,
> table_name text,
> ranges frozen<list<text>>,
> coordinator text,
> participant text,
> state text,
> state_description text,
> progress_percentage float, -- between 0.0 and 100.0
> last_updated_at_millis bigint,
> duration_micro bigint,
> failure_cause text,
> PRIMARY KEY ( (id), session_id, table_name, participant )
> )
> {code}
> repair_validations - shows the state of the validation task and updated
> periodically while validation is running; this should be called on the
> participants.
> {code:sql}
> CREATE TABLE repair_validations (
> id uuid,
> session_id uuid,
> ranges frozen<list<text>>,
> keyspace_name text,
> table_name text,
> initiator text,
> state text,
> progress_percentage float,
> queue_duration_ms bigint,
> runtime_duration_ms bigint,
> total_duration_ms bigint,
> estimated_partitions bigint,
> partitions_processed bigint,
> estimated_total_bytes bigint,
> failure_cause text,
> PRIMARY KEY ( (id), session_id, table_name )
> )
> {code}
> The main reason for exposing virtual tables rather than exposing through
> durable tables is to make sure what is exposed is accurate. In cases of
> write failures or node failures, the durable tables could become in-accurate
> and could add edge cases where the repair is not running but the tables say
> it is; by relying on repair's internal in-memory bookkeeping, these problems
> go away.
> This jira does not try to solve the following:
> 1) repair resiliency - there are edge cases where repair hits an error and
> runs forever (at least from nodetool's perspective).
> 2) repair stream tracking - I have not learned the streaming side yet and
> what I see is multiple implementations exist, so seems like high scope. My
> hope is to punt from this jira and tackle separately.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]