[
https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294182#comment-15294182
]
Paulo Motta commented on CASSANDRA-11824:
-----------------------------------------
Overall +1, just some minor nits:
* I think always registering with the FD/Gossiper at {{ActiveRepairService}}
construction will make the code simpler, instead of registering only in the
first submitted session and keeping the {{registeredForEndpointChanges}}
variable. The penalty will be negligible if there is no repair running, while
keeping less state in {{ActiveRepairService}}.
* To make the dtest more deterministic, instead of sleeping {{3 seconds}} can
you {{watch_log_for("Requesting merkle trees for")}} instead? We could maybe
also check for {{"Removing .* in parent repair sessions"}} in the log of
participant nodes, to make sure FD is killing the repair session. In my box,
for example, the coordinator was killed before any message was sent to the
participants.
> If repair fails no way to run repair again
> ------------------------------------------
>
> Key: CASSANDRA-11824
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11824
> Project: Cassandra
> Issue Type: Bug
> Reporter: T Jake Luciani
> Assignee: Marcus Eriksson
> Labels: fallout
> Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>
> I have a test that disables gossip and runs repair at the same time.
> {quote}
> WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775
> StorageService.java:384 - Stopping gossip by operator request
> INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775
> Gossiper.java:1463 - Announcing shutdown
> INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776
> StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown
> INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895
> OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32
> INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895
> OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76
> INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting
> repair command #1, repairing keyspace keyspace1 with repair options
> (parallelism: parallel, primary range: false, incremental: true, job threads:
> 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3)
> INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting
> repair command #2, repairing keyspace stresscql with repair options
> (parallelism: parallel, primary range: false, incremental: true, job threads:
> 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3)
> INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting
> repair command #3, repairing keyspace system_traces with repair options
> (parallelism: parallel, primary range: false, incremental: true, job threads:
> 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2)
> {quote}
> This ends up failing:
> {quote}
> 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933]
> Starting repair command #1, repairing keyspace keyspace1 with repair options
> (parallelism: parallel, primary range: false, incremental: true, job threads:
> 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3)
> [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints.
> List of failed endpoint(s): [172.31.24.76, 172.31.17.32]
> [2016-05-17 16:57:21,945] null
> {quote}
> Subsequent calls to repair with all nodes up still fails:
> {quote}
> ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460
> CompactionManager.java:1193 - Cannot start multiple repair sessions over the
> same sstables
> ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 -
> Failed creating a merkle tree for [repair
> #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1,
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)