[
https://issues.apache.org/jira/browse/CASSANDRA-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212949#comment-17212949
]
Alexander Dejanovski commented on CASSANDRA-15580:
--------------------------------------------------
Here's a test plan proposal:
Generate/restore a workload of ~100GB to 200GB per node.
Some SSTables will have to be deleted (in a random fashion?) to make repair go
through streaming sessions.
Perform repairs for a 3 nodes cluster with 4 cores each and 16GB RAM.
Repaired keyspaces will use RF=3 or RF=2 in some cases (the latter is for
subranges with different sets of replicas).
|| Mode || Version || Settings || Checks ||
| Full repair | trunk | Sequential + All token ranges
| "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range" |
| Full repair | trunk | Parallel + Primary range
| "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range" |
| Full repair | trunk | Force terminate repair shortly
after it was triggered | Repair threads must be cleaned up |
| Full repair | Mixed trunk + latest 3.11.x |
Sequential + All token ranges | Repair should fail |
| Subrange repair | trunk | Sequential + single token range
| "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range" |
| Subrange repair | trunk | Parallel + 10 token ranges
which have the same replicas | "No anticompaction (repairedAt == 0)
Out of sync ranges > 0
Subsequent run must show no out of sync range + Check that repair sessions are
cleaned up after a force terminate" |
| Subrange repair | trunk | Parallel + 10 token ranges
which have different replicas | "No anticompaction (repairedAt==0)
Out of sync ranges > 0
Subsequent run must show no out of sync range + Check that repair sessions are
cleaned up after a force terminate" |
| Subrange repair | trunk | "Single token range.
Force terminate repair shortly after it was triggered." | Repair threads
must be cleaned up |
| Subrange repair | Mixed trunk + latest 3.11.x |
Sequential + single token range | Repair should fail |
| Incremental repair | trunk | "Parallel (mandatory)
No compaction during repair" | "Anticompaction status (repairedAt !=
0) on all SSTables
No pending repair on SSTables after completion
Out of sync ranges > 0 + Subsequent run must show no out of sync range" |
| Incremental repair | trunk | "Parallel (mandatory)
Major compaction triggered during repair" | "Anticompaction status
(repairedAt != 0) on all SSTables
No pending repair on SSTables after completion
Out of sync ranges > 0 + Subsequent run must show no out of sync range" |
| Incremental repair | trunk | Force terminate repair
shortly after it was triggered. | Repair threads must be cleaned up
|
| Incremental repair | Mixed trunk + latest 3.11.x |
Parallel | Repair should fail |
I'm not sure about fuzz testing repair though. It's not a resilient process and
isn't designed as such. Resiliency is obtained through third party tools that
will reschedule failed repairs. If a node is/goes down and should be part of a
repair session, the repair session will simply fail AFAIK.
The mixed version tests could be challenging to set up as we probably don't
want to pin a specific version as being the "previous" one.
Should this test be performed consistently between trunk and the previous major
version? On a major version bump (when trunk moves to 5.0), I'd expect the test
to pass as repair will probably work for a bit, unless there's a check on
version numbers during repair/streaming?
> 4.0 quality testing: Repair
> ---------------------------
>
> Key: CASSANDRA-15580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15580
> Project: Cassandra
> Issue Type: Task
> Components: Test/dtest/python
> Reporter: Josh McKenzie
> Assignee: Alexander Dejanovski
> Priority: Normal
> Fix For: 4.0-rc
>
>
> Reference [doc from
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
> for context.
> *Shepherd: Alexander Dejanovski*
> We aim for 4.0 to have the first fully functioning incremental repair
> solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of
> repair: (full range, sub range, incremental) function as expected as well as
> ensuring community tools such as Reaper work. CASSANDRA-3200 adds an
> experimental option to reduce the amount of data streamed during repair, we
> should write more tests and see how it works with big nodes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]