[ 
https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-11824:
------------------------------------
    Reviewer: Paulo Motta

I can review it as it's quite fresh on my mind and somewhat related to 
CASSANDRA-3486/CASSANDRA-11190, where we will cancel ongoing tasks when the 
repair session fails.

bq. It gets a bit tricky as node A might not have realized that it was down and 
just continues with its repair, so we keep a 'failed' version of the parent 
repair session around for 24h on B and C, so if anyone tries to get that (say 
node A continues sending validation requests for example) we throw an exception 
which will fail the repair on node A as well

Now that we always register the parent repair session on {{PREPARE_MESSAGE}}, 
how about just sending an error response to the sender if we get any request 
from an unknown parent repair session id? So we don't need to keep failed 
repair sessions around.

> If repair fails no way to run repair again
> ------------------------------------------
>
>                 Key: CASSANDRA-11824
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11824
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: T Jake Luciani
>            Assignee: Marcus Eriksson
>              Labels: fallout
>             Fix For: 3.0.x
>
>
> I have a test that disables gossip and runs repair at the same time. 
> {quote}
> WARN  [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 
> StorageService.java:384 - Stopping gossip by operator request
> INFO  [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 
> Gossiper.java:1463 - Announcing shutdown
> INFO  [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 
> StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown
> INFO  [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 
> OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32
> INFO  [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 
> OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76
> INFO  [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting 
> repair command #1, repairing keyspace keyspace1 with repair options 
> (parallelism: parallel, primary range: false, incremental: true, job threads: 
> 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3)
> INFO  [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting 
> repair command #2, repairing keyspace stresscql with repair options 
> (parallelism: parallel, primary range: false, incremental: true, job threads: 
> 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3)
> INFO  [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting 
> repair command #3, repairing keyspace system_traces with repair options 
> (parallelism: parallel, primary range: false, incremental: true, job threads: 
> 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2)
> {quote}
> This ends up failing:
> {quote}
> 16:54:44.844 INFO  serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] 
> Starting repair command #1, repairing keyspace keyspace1 with repair options 
> (parallelism: parallel, primary range: false, incremental: true, job threads: 
> 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3)
> [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. 
> List of failed endpoint(s): [172.31.24.76, 172.31.17.32]
> [2016-05-17 16:57:21,945] null
> {quote}
> Subsequent calls to repair with all nodes up still fails:
> {quote}
> ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 
> CompactionManager.java:1193 - Cannot start multiple repair sessions over the 
> same sstables
> ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - 
> Failed creating a merkle tree for [repair 
> #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to