[ https://issues.apache.org/jira/browse/CASSANDRA-15399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482815#comment-17482815 ]
David Capwell commented on CASSANDRA-15399: ------------------------------------------- {code} cqlsh:ks> select * from system_views.repair_validations where repair_id=d373a480-7f08-11ec-b73e-87a2fc152add ALLOW FILTERING; @ Row 1 -------------------------------+------------------------------------------------ id | 15183e2e-bdf1-3ba0-b039-72057b2d0317 duration_millis | 36 estimated_partitions | 128 estimated_total_bytes | 27 failure_cause | null initiator | /127.0.0.1:7000 keyspace_name | ks last_updated_at | 2022-01-27 00:34:02.708000+0000 partitions_processed | 1 progress_percentage | 100 ranges | ['(-3074457345618258603,3074457345618258602]'] repair_id | d373a480-7f08-11ec-b73e-87a2fc152add session_id | d3822370-7f08-11ec-b73e-87a2fc152add state | success state_failure_timestamp | null state_init_timestamp | 2022-01-27 00:34:02.672000+0000 state_sending_trees_timestamp | 2022-01-27 00:34:02.705000+0000 state_skipped_timestamp | null state_started_timestamp | 2022-01-27 00:34:02.678000+0000 state_success_timestamp | 2022-01-27 00:34:02.708000+0000 success_message | null table_name | users @ Row 2 -------------------------------+------------------------------------------------ id | 730585a5-e105-32f3-a8e4-be6771827819 duration_millis | 37 estimated_partitions | 1 estimated_total_bytes | 26 failure_cause | null initiator | /127.0.0.1:7000 keyspace_name | ks last_updated_at | 2022-01-27 00:34:02.706000+0000 partitions_processed | 1 progress_percentage | 100 ranges | ['(3074457345618258602,-9223372036854775808]'] repair_id | d373a480-7f08-11ec-b73e-87a2fc152add session_id | d38027a0-7f08-11ec-b73e-87a2fc152add state | success state_failure_timestamp | null state_init_timestamp | 2022-01-27 00:34:02.669000+0000 state_sending_trees_timestamp | 2022-01-27 00:34:02.703000+0000 state_skipped_timestamp | null state_started_timestamp | 2022-01-27 00:34:02.676000+0000 state_success_timestamp | 2022-01-27 00:34:02.706000+0000 success_message | null table_name | users (2 rows) {code} > Add ability to track state in repair > ------------------------------------ > > Key: CASSANDRA-15399 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15399 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair > Reporter: David Capwell > Assignee: David Capwell > Priority: Normal > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > To enhance the visibility in repair, we should expose internal state via > virtual tables; the state should include coordinator as well as participant > state (validation, sync, etc.) > I propose the following tables: > repairs - high level summary of the global state of repair; this should be > called on the coordinator. > {code:sql} > CREATE TABLE repairs ( > id uuid, > keyspace_name text, > table_names frozen<list<text>>, > ranges frozen<list<text>>, > coordinator text, > participants frozen<list<text>>, > state text, > progress_percentage float, > last_updated_at_millis bigint, > duration_micro bigint, > failure_cause text, > PRIMARY KEY ( (id) ) > ) > {code} > repair_tasks - represents RepairJob and participants state. This will show > if validations are running on participants and the progress they are making; > this should be called on the coordinator. > {code:sql} > CREATE TABLE repair_tasks ( > id uuid, > session_id uuid, > keyspace_name text, > table_name text, > ranges frozen<list<text>>, > coordinator text, > participant text, > state text, > state_description text, > progress_percentage float, -- between 0.0 and 100.0 > last_updated_at_millis bigint, > duration_micro bigint, > failure_cause text, > PRIMARY KEY ( (id), session_id, table_name, participant ) > ) > {code} > repair_validations - shows the state of the validation task and updated > periodically while validation is running; this should be called on the > participants. > {code:sql} > CREATE TABLE repair_validations ( > id uuid, > session_id uuid, > ranges frozen<list<text>>, > keyspace_name text, > table_name text, > initiator text, > state text, > progress_percentage float, > queue_duration_ms bigint, > runtime_duration_ms bigint, > total_duration_ms bigint, > estimated_partitions bigint, > partitions_processed bigint, > estimated_total_bytes bigint, > failure_cause text, > PRIMARY KEY ( (id), session_id, table_name ) > ) > {code} > The main reason for exposing virtual tables rather than exposing through > durable tables is to make sure what is exposed is accurate. In cases of > write failures or node failures, the durable tables could become in-accurate > and could add edge cases where the repair is not running but the tables say > it is; by relying on repair's internal in-memory bookkeeping, these problems > go away. > This jira does not try to solve the following: > 1) repair resiliency - there are edge cases where repair hits an error and > runs forever (at least from nodetool's perspective). > 2) repair stream tracking - I have not learned the streaming side yet and > what I see is multiple implementations exist, so seems like high scope. My > hope is to punt from this jira and tackle separately. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org