[ 
https://issues.apache.org/jira/browse/CASSANDRA-15399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480257#comment-17480257
 ] 

David Capwell commented on CASSANDRA-15399:
-------------------------------------------

finally got time to come back to this, sent out a work-in-progress PR that 
shows coordinator state as well as validation; missing streaming...

here is sample output querying after running a repair

{code}
node2: SELECT * FROM system_views.repairs
id: 967551d0-7af5-11ec-abdf-517b0d807e7c
duration_millis: 555
failure_cause: null
keyspace_name: distributed_test_keyspace
last_updated_at: Fri Jan 21 12:06:15 PST 2022
participants: [/127.0.0.1:7012]
progress_percentage: 100.0
ranges: [[(-1,9223372036854775805], (9223372036854775805,-1]]]
state: success
state_init_timestamp: Fri Jan 21 12:06:15 PST 2022
state_prepare_complete_timestamp: Fri Jan 21 12:06:15 PST 2022
state_prepare_submit_timestamp: Fri Jan 21 12:06:15 PST 2022
state_repair_complete_timestamp: Fri Jan 21 12:06:15 PST 2022
state_repair_submit_timestamp: Fri Jan 21 12:06:15 PST 2022
state_setup_timestamp: Fri Jan 21 12:06:15 PST 2022
state_started_timestamp: Fri Jan 21 12:06:15 PST 2022
state_success_timestamp: Fri Jan 21 12:06:15 PST 2022
table_names: [simple_preview_sequential_true]
type: preview
unfiltered_ranges: [[(-1,9223372036854775805], (9223372036854775805,-1]]]


node2: SELECT * FROM system_views.repair_sessions
id: 968111a0-7af5-11ec-abdf-517b0d807e7c
duration_millis: 470
failure_cause: null
keyspace_name: distributed_test_keyspace
last_updated_at: Fri Jan 21 12:06:15 PST 2022
progress_percentage: 100.0
ranges: [(-1,9223372036854775805], (9223372036854775805,-1]]
repair_id: 967551d0-7af5-11ec-abdf-517b0d807e7c
state: success
state_failure_timestamp: null
state_init_timestamp: Fri Jan 21 12:06:15 PST 2022
state_jobs_submit_timestamp: Fri Jan 21 12:06:15 PST 2022
state_skipped_timestamp: null
state_started_timestamp: Fri Jan 21 12:06:15 PST 2022
state_success_timestamp: Fri Jan 21 12:06:15 PST 2022
table_names: [simple_preview_sequential_true]


node2: SELECT * FROM system_views.repair_jobs
id: 2e8c3999-04b0-3013-a351-f8a43283c7c2
duration_millis: 457
failure_cause: null
keyspace_name: distributed_test_keyspace
last_updated_at: Fri Jan 21 12:06:15 PST 2022
progress_percentage: 100.0
ranges: [(-1,9223372036854775805], (9223372036854775805,-1]]
repair_id: 967551d0-7af5-11ec-abdf-517b0d807e7c
session_id: 968111a0-7af5-11ec-abdf-517b0d807e7c
state: success
state_failure_timestamp: null
state_init_timestamp: Fri Jan 21 12:06:15 PST 2022
state_snapshot_complete_timestamp: Fri Jan 21 12:06:15 PST 2022
state_snapshot_submit_timestamp: Fri Jan 21 12:06:15 PST 2022
state_started_timestamp: Fri Jan 21 12:06:15 PST 2022
state_stream_submit_timestamp: Fri Jan 21 12:06:15 PST 2022
state_success_timestamp: Fri Jan 21 12:06:15 PST 2022
state_validation_complete_timestamp: Fri Jan 21 12:06:15 PST 2022
state_validation_submit_timestamp: Fri Jan 21 12:06:15 PST 2022
table_name: simple_preview_sequential_true


node1: SELECT * FROM system_views.repair_validations
id: 2e8c3999-04b0-3013-a351-f8a43283c7c2
duration_millis: 83
estimated_partitions: 129
estimated_total_bytes: 28
failure_cause: null
initiator: /127.0.0.2:7012
keyspace_name: distributed_test_keyspace
last_updated_at: Fri Jan 21 12:06:15 PST 2022
partitions_processed: 1
progress_percentage: 100.0
ranges: [(-1,9223372036854775805], (9223372036854775805,-1]]
repair_id: 967551d0-7af5-11ec-abdf-517b0d807e7c
session_id: 968111a0-7af5-11ec-abdf-517b0d807e7c
state: success
state_failure_timestamp: null
state_init_timestamp: Fri Jan 21 12:06:15 PST 2022
state_sending_trees_timestamp: Fri Jan 21 12:06:15 PST 2022
state_skipped_timestamp: null
state_started_timestamp: Fri Jan 21 12:06:15 PST 2022
state_success_timestamp: Fri Jan 21 12:06:15 PST 2022
table_name: simple_preview_sequential_true
{code}

> Add ability to track state in repair
> ------------------------------------
>
>                 Key: CASSANDRA-15399
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15399
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Repair
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> To enhance the visibility in repair, we should expose internal state via 
> virtual tables; the state should include coordinator as well as participant 
> state (validation, sync, etc.)
> I propose the following tables:
> repairs - high level summary of the global state of repair; this should be 
> called on the coordinator.
> {code:sql}
> CREATE TABLE repairs (
>   id uuid,
>   keyspace_name text,
>   table_names frozen<list<text>>,
>   ranges frozen<list<text>>,
>   coordinator text,
>   participants frozen<list<text>>,
>   state text,
>   progress_percentage float,
>   last_updated_at_millis bigint,
>   duration_micro bigint,
>   failure_cause text,
>   PRIMARY KEY ( (id) )
> )
> {code}
> repair_tasks - represents RepairJob and participants state.  This will show 
> if validations are running on participants and the progress they are making; 
> this should be called on the coordinator.
> {code:sql}
> CREATE TABLE repair_tasks (
>   id uuid,
>   session_id uuid,
>   keyspace_name text,
>   table_name text,
>   ranges frozen<list<text>>,
>   coordinator text,
>   participant text,
>   state text,
>   state_description text,
>   progress_percentage float, -- between 0.0 and 100.0
>   last_updated_at_millis bigint,
>   duration_micro bigint,
>   failure_cause text,
>   PRIMARY KEY ( (id), session_id, table_name, participant )
> )
> {code}
> repair_validations - shows the state of the validation task and updated 
> periodically while validation is running; this should be called on the 
> participants.
> {code:sql}
> CREATE TABLE repair_validations (
>   id uuid,
>   session_id uuid,
>   ranges frozen<list<text>>,
>   keyspace_name text,
>   table_name text,
>   initiator text,
>   state text,
>   progress_percentage float,
>   queue_duration_ms bigint,
>   runtime_duration_ms bigint,
>   total_duration_ms bigint,
>   estimated_partitions bigint,
>   partitions_processed bigint,
>   estimated_total_bytes bigint,
>   failure_cause text,
>   PRIMARY KEY ( (id), session_id, table_name )
> )
> {code}
> The main reason for exposing virtual tables rather than exposing through 
> durable tables is to make sure what is exposed is accurate.  In cases of 
> write failures or node failures, the durable tables could become in-accurate 
> and could add edge cases where the repair is not running but the tables say 
> it is; by relying on repair's internal in-memory bookkeeping, these problems 
> go away.
> This jira does not try to solve the following:
> 1) repair resiliency - there are edge cases where repair hits an error and 
> runs forever (at least from nodetool's perspective).
> 2) repair stream tracking - I have not learned the streaming side yet and 
> what I see is multiple implementations exist, so seems like high scope.  My 
> hope is to punt from this jira and tackle separately.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to