Kurt Greaves commented on CASSANDRA-13797:

This issue is actually pretty serious for anyone running vnodes and a mid sized 
cluster. This isn't the first time we've had unbounded validation compactions 
kicked off simultaneously and it's caused a lot of problems at Instaclustr in 
the past. We should really fix this by 3.11.3 because it easily causes massive 
latency spikes whenever a repair kicks off due to validations taking up all the 
CPU. I'd like a simple revert but that doesn't fix the issue in the 
description. Don't think this warrants a new ticket so I think reopening this 
one is in order. [~bdeggleston] [~krummas] WDYT?

> RepairJob blocks on syncTasks
> -----------------------------
>                 Key: CASSANDRA-13797
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13797
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Repair
>            Reporter: Blake Eggleston
>            Assignee: Blake Eggleston
>            Priority: Major
>             Fix For: 3.0.15, 3.11.1, 4.0
> The thread running {{RepairJob}} blocks while it waits for the validations it 
> starts to complete ([see 
> here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]).
>  However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't 
> waiting for {{RepairJob#run}} to return, they're waiting for a result to be 
> set on RepairJob the future, which happens after the sync tasks have 
> completed. This post repair cleanup stuff also immediately shuts down the 
> executor {{RepairJob#run}} is running in. So in noop repair sessions, where 
> there's nothing to stream, I'm seeing the callbacks sometimes fire before 
> {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown.
> I'm pretty sure this can just be removed, but I'd like a second opinion. This 
> appears to just be a holdover from before repair coordination became async. I 
> thought it might be doing some throttling by blocking, but each repair 
> session gets it's own executor, and validation is  throttled by the fixed 
> size executors doing the actual work of validation, so I don't think we need 
> to keep this around.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to