[
https://issues.apache.org/jira/browse/CASSANDRA-15399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482815#comment-17482815
]
David Capwell commented on CASSANDRA-15399:
-------------------------------------------
{code}
cqlsh:ks> select * from system_views.repair_validations where
repair_id=d373a480-7f08-11ec-b73e-87a2fc152add ALLOW FILTERING;
@ Row 1
-------------------------------+------------------------------------------------
id | 15183e2e-bdf1-3ba0-b039-72057b2d0317
duration_millis | 36
estimated_partitions | 128
estimated_total_bytes | 27
failure_cause | null
initiator | /127.0.0.1:7000
keyspace_name | ks
last_updated_at | 2022-01-27 00:34:02.708000+0000
partitions_processed | 1
progress_percentage | 100
ranges | ['(-3074457345618258603,3074457345618258602]']
repair_id | d373a480-7f08-11ec-b73e-87a2fc152add
session_id | d3822370-7f08-11ec-b73e-87a2fc152add
state | success
state_failure_timestamp | null
state_init_timestamp | 2022-01-27 00:34:02.672000+0000
state_sending_trees_timestamp | 2022-01-27 00:34:02.705000+0000
state_skipped_timestamp | null
state_started_timestamp | 2022-01-27 00:34:02.678000+0000
state_success_timestamp | 2022-01-27 00:34:02.708000+0000
success_message | null
table_name | users
@ Row 2
-------------------------------+------------------------------------------------
id | 730585a5-e105-32f3-a8e4-be6771827819
duration_millis | 37
estimated_partitions | 1
estimated_total_bytes | 26
failure_cause | null
initiator | /127.0.0.1:7000
keyspace_name | ks
last_updated_at | 2022-01-27 00:34:02.706000+0000
partitions_processed | 1
progress_percentage | 100
ranges | ['(3074457345618258602,-9223372036854775808]']
repair_id | d373a480-7f08-11ec-b73e-87a2fc152add
session_id | d38027a0-7f08-11ec-b73e-87a2fc152add
state | success
state_failure_timestamp | null
state_init_timestamp | 2022-01-27 00:34:02.669000+0000
state_sending_trees_timestamp | 2022-01-27 00:34:02.703000+0000
state_skipped_timestamp | null
state_started_timestamp | 2022-01-27 00:34:02.676000+0000
state_success_timestamp | 2022-01-27 00:34:02.706000+0000
success_message | null
table_name | users
(2 rows)
{code}
> Add ability to track state in repair
> ------------------------------------
>
> Key: CASSANDRA-15399
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15399
> Project: Cassandra
> Issue Type: Improvement
> Components: Consistency/Repair
> Reporter: David Capwell
> Assignee: David Capwell
> Priority: Normal
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> To enhance the visibility in repair, we should expose internal state via
> virtual tables; the state should include coordinator as well as participant
> state (validation, sync, etc.)
> I propose the following tables:
> repairs - high level summary of the global state of repair; this should be
> called on the coordinator.
> {code:sql}
> CREATE TABLE repairs (
> id uuid,
> keyspace_name text,
> table_names frozen<list<text>>,
> ranges frozen<list<text>>,
> coordinator text,
> participants frozen<list<text>>,
> state text,
> progress_percentage float,
> last_updated_at_millis bigint,
> duration_micro bigint,
> failure_cause text,
> PRIMARY KEY ( (id) )
> )
> {code}
> repair_tasks - represents RepairJob and participants state. This will show
> if validations are running on participants and the progress they are making;
> this should be called on the coordinator.
> {code:sql}
> CREATE TABLE repair_tasks (
> id uuid,
> session_id uuid,
> keyspace_name text,
> table_name text,
> ranges frozen<list<text>>,
> coordinator text,
> participant text,
> state text,
> state_description text,
> progress_percentage float, -- between 0.0 and 100.0
> last_updated_at_millis bigint,
> duration_micro bigint,
> failure_cause text,
> PRIMARY KEY ( (id), session_id, table_name, participant )
> )
> {code}
> repair_validations - shows the state of the validation task and updated
> periodically while validation is running; this should be called on the
> participants.
> {code:sql}
> CREATE TABLE repair_validations (
> id uuid,
> session_id uuid,
> ranges frozen<list<text>>,
> keyspace_name text,
> table_name text,
> initiator text,
> state text,
> progress_percentage float,
> queue_duration_ms bigint,
> runtime_duration_ms bigint,
> total_duration_ms bigint,
> estimated_partitions bigint,
> partitions_processed bigint,
> estimated_total_bytes bigint,
> failure_cause text,
> PRIMARY KEY ( (id), session_id, table_name )
> )
> {code}
> The main reason for exposing virtual tables rather than exposing through
> durable tables is to make sure what is exposed is accurate. In cases of
> write failures or node failures, the durable tables could become in-accurate
> and could add edge cases where the repair is not running but the tables say
> it is; by relying on repair's internal in-memory bookkeeping, these problems
> go away.
> This jira does not try to solve the following:
> 1) repair resiliency - there are edge cases where repair hits an error and
> runs forever (at least from nodetool's perspective).
> 2) repair stream tracking - I have not learned the streaming side yet and
> what I see is multiple implementations exist, so seems like high scope. My
> hope is to punt from this jira and tackle separately.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]