[jira] [Commented] (CASSANDRA-15399) Add ability to track state in repair

David Capwell (Jira) Wed, 26 Jan 2022 16:42:09 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17482815#comment-17482815
 ]


David Capwell commented on CASSANDRA-15399:
-------------------------------------------

{code}
cqlsh:ks> select * from system_views.repair_validations where 
repair_id=d373a480-7f08-11ec-b73e-87a2fc152add ALLOW FILTERING;

@ Row 1
-------------------------------+------------------------------------------------
 id                            | 15183e2e-bdf1-3ba0-b039-72057b2d0317
 duration_millis               | 36
 estimated_partitions          | 128
 estimated_total_bytes         | 27
 failure_cause                 | null
 initiator                     | /127.0.0.1:7000
 keyspace_name                 | ks
 last_updated_at               | 2022-01-27 00:34:02.708000+0000
 partitions_processed          | 1
 progress_percentage           | 100
 ranges                        | ['(-3074457345618258603,3074457345618258602]']
 repair_id                     | d373a480-7f08-11ec-b73e-87a2fc152add
 session_id                    | d3822370-7f08-11ec-b73e-87a2fc152add
 state                         | success
 state_failure_timestamp       | null
 state_init_timestamp          | 2022-01-27 00:34:02.672000+0000
 state_sending_trees_timestamp | 2022-01-27 00:34:02.705000+0000
 state_skipped_timestamp       | null
 state_started_timestamp       | 2022-01-27 00:34:02.678000+0000
 state_success_timestamp       | 2022-01-27 00:34:02.708000+0000
 success_message               | null
 table_name                    | users

@ Row 2
-------------------------------+------------------------------------------------
 id                            | 730585a5-e105-32f3-a8e4-be6771827819
 duration_millis               | 37
 estimated_partitions          | 1
 estimated_total_bytes         | 26
 failure_cause                 | null
 initiator                     | /127.0.0.1:7000
 keyspace_name                 | ks
 last_updated_at               | 2022-01-27 00:34:02.706000+0000
 partitions_processed          | 1
 progress_percentage           | 100
 ranges                        | ['(3074457345618258602,-9223372036854775808]']
 repair_id                     | d373a480-7f08-11ec-b73e-87a2fc152add
 session_id                    | d38027a0-7f08-11ec-b73e-87a2fc152add
 state                         | success
 state_failure_timestamp       | null
 state_init_timestamp          | 2022-01-27 00:34:02.669000+0000
 state_sending_trees_timestamp | 2022-01-27 00:34:02.703000+0000
 state_skipped_timestamp       | null
 state_started_timestamp       | 2022-01-27 00:34:02.676000+0000
 state_success_timestamp       | 2022-01-27 00:34:02.706000+0000
 success_message               | null
 table_name                    | users

(2 rows)
{code}

> Add ability to track state in repair
> ------------------------------------
>
>                 Key: CASSANDRA-15399
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15399
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Repair
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> To enhance the visibility in repair, we should expose internal state via 
> virtual tables; the state should include coordinator as well as participant 
> state (validation, sync, etc.)
> I propose the following tables:
> repairs - high level summary of the global state of repair; this should be 
> called on the coordinator.
> {code:sql}
> CREATE TABLE repairs (
>   id uuid,
>   keyspace_name text,
>   table_names frozen<list<text>>,
>   ranges frozen<list<text>>,
>   coordinator text,
>   participants frozen<list<text>>,
>   state text,
>   progress_percentage float,
>   last_updated_at_millis bigint,
>   duration_micro bigint,
>   failure_cause text,
>   PRIMARY KEY ( (id) )
> )
> {code}
> repair_tasks - represents RepairJob and participants state.  This will show 
> if validations are running on participants and the progress they are making; 
> this should be called on the coordinator.
> {code:sql}
> CREATE TABLE repair_tasks (
>   id uuid,
>   session_id uuid,
>   keyspace_name text,
>   table_name text,
>   ranges frozen<list<text>>,
>   coordinator text,
>   participant text,
>   state text,
>   state_description text,
>   progress_percentage float, -- between 0.0 and 100.0
>   last_updated_at_millis bigint,
>   duration_micro bigint,
>   failure_cause text,
>   PRIMARY KEY ( (id), session_id, table_name, participant )
> )
> {code}
> repair_validations - shows the state of the validation task and updated 
> periodically while validation is running; this should be called on the 
> participants.
> {code:sql}
> CREATE TABLE repair_validations (
>   id uuid,
>   session_id uuid,
>   ranges frozen<list<text>>,
>   keyspace_name text,
>   table_name text,
>   initiator text,
>   state text,
>   progress_percentage float,
>   queue_duration_ms bigint,
>   runtime_duration_ms bigint,
>   total_duration_ms bigint,
>   estimated_partitions bigint,
>   partitions_processed bigint,
>   estimated_total_bytes bigint,
>   failure_cause text,
>   PRIMARY KEY ( (id), session_id, table_name )
> )
> {code}
> The main reason for exposing virtual tables rather than exposing through 
> durable tables is to make sure what is exposed is accurate.  In cases of 
> write failures or node failures, the durable tables could become in-accurate 
> and could add edge cases where the repair is not running but the tables say 
> it is; by relying on repair's internal in-memory bookkeeping, these problems 
> go away.
> This jira does not try to solve the following:
> 1) repair resiliency - there are edge cases where repair hits an error and 
> runs forever (at least from nodetool's perspective).
> 2) repair stream tracking - I have not learned the streaming side yet and 
> what I see is multiple implementations exist, so seems like high scope.  My 
> hope is to punt from this jira and tackle separately.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15399) Add ability to track state in repair

Reply via email to