IR repair to standardize repair cleanup and error handling of failed RepairJobs

David Capwell (Jira) Fri, 12 Nov 2021 11:58:17 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442920#comment-17442920
 ]


David Capwell commented on CASSANDRA-17069:
-------------------------------------------

Here are my results so far with 
org.apache.cassandra.distributed.test.NetstatsBootstrapWithEntireSSTablesCompressionStreamingTest.testWithStreamingEntireSSTablesWithoutCompressionWithoutThrottling

The stack trace is for the following line

{code}
final Future<AbstractNetstatsStreaming.NetstatResults> netstatsFuture = 
executorService.submit(new NetstatsCallable(cluster.get(1)));

final AbstractNetstatsStreaming.NetstatResults results = netstatsFuture.get(1, 
MINUTES); // timeout here
{code}

This future calls nodetool in a loop with sleeps (Thread.sleep(500)).  It stops 
looping after it no longer sees Receiving/Sending in the logs (aka streaming 
ran but is no longer running).  After this point it awaits for the node to come 
up (2m timeout)...

I do not believe this patch impacts this test (ran locally and hard to hit this 
case), but just in case I plan to patch the test (see a flake, fix a flake) to 
wait longer for node to come up (before was practically 3m) and then check for 
streaming (if we are not done streaming after node2 is up... what happened?).

> Refactor normal/preview/IR repair to standardize repair cleanup and error 
> handling of failed RepairJobs
> -------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17069
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17069
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Repair
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.x
>
>
> Right now we have 3 different implementations of repair: normal, preview, and 
> incremental (IR); all 3 handle RepairJob failures differently and offer 
> different state cleanup.  To make sure that we consistently handle errors the 
> same way and cleanup, we should move these responsibilities outside of the 
> repair task itself and move these into common APIs and move some logic into 
> the repair coordination its self.
> This work relates with CASSANDRA-15399 as special handling each task makes 
> the work more complex.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-17069) Refactor normal/preview/IR repair to standardize repair cleanup and error handling of failed RepairJobs

Reply via email to