[ https://issues.apache.org/jira/browse/CASSANDRA-13797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383119#comment-16383119 ]
Vincent White commented on CASSANDRA-13797: ------------------------------------------- After upgrading an ~18 node, vnode multi-DC cluster from 3.11.0 to 3.11.1 it started seeing some nodes running hundreds of concurrent validation compactions, rolling back it went back to 1 concurrent validation per CF. I haven't had a chance to reproduce it at that scale but my locale testing show that if I have enough data, or just add a sleep(9999999) to validation compactions to simulate long validations, they continue to accumulate over a few seconds until the repair session has looped through all the common ranges. > RepairJob blocks on syncTasks > ----------------------------- > > Key: CASSANDRA-13797 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13797 > Project: Cassandra > Issue Type: Bug > Components: Repair > Reporter: Blake Eggleston > Assignee: Blake Eggleston > Priority: Major > Fix For: 3.0.15, 3.11.1, 4.0 > > > The thread running {{RepairJob}} blocks while it waits for the validations it > starts to complete ([see > here|https://github.com/bdeggleston/cassandra/blob/9fdec0a82851f5c35cd21d02e8c4da8fc685edb2/src/java/org/apache/cassandra/repair/RepairJob.java#L185]). > However, the downstream callbacks (ie: the post-repair cleanup stuff) aren't > waiting for {{RepairJob#run}} to return, they're waiting for a result to be > set on RepairJob the future, which happens after the sync tasks have > completed. This post repair cleanup stuff also immediately shuts down the > executor {{RepairJob#run}} is running in. So in noop repair sessions, where > there's nothing to stream, I'm seeing the callbacks sometimes fire before > {{RepairJob#run}} wakes up, and causing an {{InterruptedException}} is thrown. > I'm pretty sure this can just be removed, but I'd like a second opinion. This > appears to just be a holdover from before repair coordination became async. I > thought it might be doing some throttling by blocking, but each repair > session gets it's own executor, and validation is throttled by the fixed > size executors doing the actual work of validation, so I don't think we need > to keep this around. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org