[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

Alexander Dejanovski (Jira) Tue, 13 Oct 2020 01:42:21 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212949#comment-17212949
 ]


Alexander Dejanovski commented on CASSANDRA-15580:
--------------------------------------------------

Here's a test plan proposal: 

Generate/restore a workload of ~100GB to 200GB per node.
Some SSTables will have to be deleted (in a random fashion?) to make repair go 
through streaming sessions.
Perform repairs for a 3 nodes cluster with 4 cores each and 16GB RAM.
Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for 
subranges with different sets of replicas).

||      Mode    ||      Version ||      Settings        ||      Checks  ||
|       Full repair     |       trunk   |       Sequential + All token ranges   
|       "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range"  |
|       Full repair     |       trunk   |       Parallel + Primary range        
|       "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range"  |
|       Full repair     |       trunk   |       Force terminate repair shortly 
after it was triggered   |       Repair threads must be cleaned up       |
|       Full repair     |       Mixed trunk + latest 3.11.x     |       
Sequential + All token ranges   |       Repair should fail      |
|       Subrange repair |       trunk   |       Sequential + single token range 
|       "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range"  |
|       Subrange repair |       trunk   |       Parallel + 10 token ranges 
which have the same replicas |       "No anticompaction (repairedAt == 0)
Out of sync ranges > 0
Subsequent run must show no out of sync range + Check that repair sessions are 
cleaned up after a force terminate"      |
|       Subrange repair |       trunk   |       Parallel + 10 token ranges 
which have different replicas        |       "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range + Check that repair sessions are 
cleaned up after a force terminate"      |
|       Subrange repair |       trunk   |       "Single token range.
Force terminate repair shortly after it was triggered." |       Repair threads 
must be cleaned up       |
|       Subrange repair |       Mixed trunk + latest 3.11.x     |       
Sequential + single token range |       Repair should fail      |
|       Incremental repair      |       trunk   |       "Parallel (mandatory)
No compaction during repair"    |       "Anticompaction status (repairedAt != 
0) on all SSTables
No pending repair on SSTables after completion
Out of sync ranges > 0 + Subsequent run must show no out of sync range" |
|       Incremental repair      |       trunk   |       "Parallel (mandatory)
Major compaction triggered during repair"       |       "Anticompaction status 
(repairedAt != 0) on all SSTables
No pending repair on SSTables after completion
Out of sync ranges > 0 + Subsequent run must show no out of sync range" |
|       Incremental repair      |       trunk   |       Force terminate repair 
shortly after it was triggered.  |       Repair threads must be cleaned up      
 |
|       Incremental repair      |       Mixed trunk + latest 3.11.x     |       
Parallel        |       Repair should fail      |

I'm not sure about fuzz testing repair though. It's not a resilient process and 
isn't designed as such. Resiliency is obtained through third party tools that 
will reschedule failed repairs. If a node is/goes down and should be part of a 
repair session, the repair session will simply fail AFAIK.

The mixed version tests could be challenging to set up as we probably don't 
want to pin a specific version as being the "previous" one.
Should this test be performed consistently between trunk and the previous major 
version? On a major version bump (when trunk moves to 5.0), I'd expect the test 
to pass as repair will probably work for a bit, unless there's a check on 
version numbers during repair/streaming?

> 4.0 quality testing: Repair
> ---------------------------
>
>                 Key: CASSANDRA-15580
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
>             Project: Cassandra
>          Issue Type: Task
>          Components: Test/dtest/python
>            Reporter: Josh McKenzie
>            Assignee: Alexander Dejanovski
>            Priority: Normal
>             Fix For: 4.0-rc
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair 
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of 
> repair: (full range, sub range, incremental) function as expected as well as 
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an 
> experimental option to reduce the amount of data streamed during repair, we 
> should write more tests and see how it works with big nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15580) 4.0 quality testing: Repair

Reply via email to