[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307710#comment-15307710 ] Francesco Animali commented on CASSANDRA-11824: --- got two questions: 1- with this fix, will client be able to run incremental repair service from opscenter without risk of incurring in `Cannot start multiple repair sessions over the same sstables` errors? 2- is there a dse version that includes cassandra 2.1.15 (dse 4.8.8) or can this fix be installed with the cassandra.jar ? thank you :-) > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 2.1.15, 2.2.7, 3.7, 3.0.7 > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15299187#comment-15299187 ] Paulo Motta commented on CASSANDRA-11824: - tests look good now, +1 > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298563#comment-15298563 ] Marcus Eriksson commented on CASSANDRA-11824: - pushed a new commit to the dtest (in 2.1 we log "requesting merkle trees for", in 2.2+ "Requesting merkle trees for") and triggered new builds > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298474#comment-15298474 ] Marcus Eriksson commented on CASSANDRA-11824: - seems the new test failed in the new build you triggered [~pauloricardomg], I'll check it tomorrow > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296125#comment-15296125 ] Marcus Eriksson commented on CASSANDRA-11824: - pushed a dtest update to the repo above > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294286#comment-15294286 ] Paulo Motta commented on CASSANDRA-11824: - bq. I wanted to avoid leaking this out of the constructor, at least in theory we could get gossip events to a non-fully constructed object ah that's right, thanks for pointing this out. perhaps we should register it in the static block or replace the singleton field with a {{getInstance()}}? Anyway this is just a style preference, don't bother too much about it. :) > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294202#comment-15294202 ] Marcus Eriksson commented on CASSANDRA-11824: - bq. always registering with the FD/Gossiper at ActiveRepairService construction will make the code simpler I wanted to avoid leaking {{this}} out of the constructor, at least in theory we could get gossip events to a non-fully constructed object I will improve the dtest with your suggestions > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15294182#comment-15294182 ] Paulo Motta commented on CASSANDRA-11824: - Overall +1, just some minor nits: * I think always registering with the FD/Gossiper at {{ActiveRepairService}} construction will make the code simpler, instead of registering only in the first submitted session and keeping the {{registeredForEndpointChanges}} variable. The penalty will be negligible if there is no repair running, while keeping less state in {{ActiveRepairService}}. * To make the dtest more deterministic, instead of sleeping {{3 seconds}} can you {{watch_log_for("Requesting merkle trees for")}} instead? We could maybe also check for {{"Removing .* in parent repair sessions"}} in the log of participant nodes, to make sure FD is killing the repair session. In my box, for example, the coordinator was killed before any message was sent to the participants. > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15292779#comment-15292779 ] Marcus Eriksson commented on CASSANDRA-11824: - fallout test passed as well > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 3.0.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291124#comment-15291124 ] Marcus Eriksson commented on CASSANDRA-11824: - pushed and new builds triggered > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 3.0.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291089#comment-15291089 ] Marcus Eriksson commented on CASSANDRA-11824: - [~pauloricardomg] yeah good point, we can do that in 2.2+ - in 2.1 it is still valid to not have a PRS, will update the patches > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 3.0.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291052#comment-15291052 ] Marcus Eriksson commented on CASSANDRA-11824: - Problem occurs when the repair coordinator dies - then the repairing nodes won't clear out the ParentRepairSessions My approach is to have ActiveRepairService start listening for endpoint changes and failure detector events, so for example: * a cluster with A, B, C, we trigger repair against A. * during repair, A dies * B, C gets notified about this and marks the ParentRepairSession as failed. It gets a bit tricky as node A might not have realized that it was down and just continues with its repair, so we keep a 'failed' version of the parent repair session around for 24h on B and C, so if anyone tries to get that (say node A continues sending validation requests for example) we throw an exception which will fail the repair on node A as well A dtest to reproduce the error: https://github.com/krummas/cassandra-dtest/commits/marcuse/11824 ||branch||testall||dtest|| |[marcuse/11824|https://github.com/krummas/cassandra/tree/marcuse/11824]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-dtest]| |[marcuse/11824-2.2|https://github.com/krummas/cassandra/tree/marcuse/11824-2.2]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-2.2-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-2.2-dtest]| |[marcuse/11824-3.0|https://github.com/krummas/cassandra/tree/marcuse/11824-3.0]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.0-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.0-dtest]| |[marcuse/11824-3.7|https://github.com/krummas/cassandra/tree/marcuse/11824-3.7]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.7-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-3.7-dtest]| |[marcuse/11824-trunk|https://github.com/krummas/cassandra/tree/marcuse/11824-trunk]|[testall|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-trunk-testall]|[dtest|http://cassci.datastax.com/view/Dev/view/krummas/job/krummas-marcuse-11824-trunk-dtest]| should also note that this does not seem to fix CASSANDRA-11728 could you review [~yukim]? > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 3.0.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289074#comment-15289074 ] Marcus Eriksson commented on CASSANDRA-11824: - yeah we are testing a patch, problem is that we don't clean up the parent repair session when the repair coordinator dies > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 3.0.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289068#comment-15289068 ] Nick Bailey commented on CASSANDRA-11824: - Hmm. Yeah this could be the cause of CASSANDRA-11728. I remember seeing dropped message warnings in the logs during that test which could be a similar situation to turning off gossip. > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 3.0.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11824) If repair fails no way to run repair again
[ https://issues.apache.org/jira/browse/CASSANDRA-11824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288626#comment-15288626 ] Francesco Animali commented on CASSANDRA-11824: --- hi [~tjake], this scenario you reproduced doesn;t seem too far off from what [~nickmbailey] said here: https://datastax.jira.com/browse/OPSC-8202?focusedCommentId=146147&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-146147 > If repair fails no way to run repair again > -- > > Key: CASSANDRA-11824 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11824 > Project: Cassandra > Issue Type: Bug >Reporter: T Jake Luciani >Assignee: Marcus Eriksson > Labels: fallout > Fix For: 3.0.x > > > I have a test that disables gossip and runs repair at the same time. > {quote} > WARN [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > StorageService.java:384 - Stopping gossip by operator request > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,775 > Gossiper.java:1463 - Announcing shutdown > INFO [RMI TCP Connection(15)-54.67.121.105] 2016-05-17 16:57:21,776 > StorageService.java:1999 - Node /172.31.31.1 state jump to shutdown > INFO [HANDSHAKE-/172.31.17.32] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.17.32 > INFO [HANDSHAKE-/172.31.24.76] 2016-05-17 16:57:21,895 > OutboundTcpConnection.java:514 - Handshaking version with /172.31.24.76 > INFO [Thread-25] 2016-05-17 16:57:21,925 RepairRunnable.java:125 - Starting > repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-26] 2016-05-17 16:57:21,953 RepairRunnable.java:125 - Starting > repair command #2, repairing keyspace stresscql with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > INFO [Thread-27] 2016-05-17 16:57:21,967 RepairRunnable.java:125 - Starting > repair command #3, repairing keyspace system_traces with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 2) > {quote} > This ends up failing: > {quote} > 16:54:44.844 INFO serverGroup-node-1-574 - STDOUT: [2016-05-17 16:57:21,933] > Starting repair command #1, repairing keyspace keyspace1 with repair options > (parallelism: parallel, primary range: false, incremental: true, job threads: > 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3) > [2016-05-17 16:57:21,943] Did not get positive replies from all endpoints. > List of failed endpoint(s): [172.31.24.76, 172.31.17.32] > [2016-05-17 16:57:21,945] null > {quote} > Subsequent calls to repair with all nodes up still fails: > {quote} > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 > CompactionManager.java:1193 - Cannot start multiple repair sessions over the > same sstables > ERROR [ValidationExecutor:3] 2016-05-17 18:58:53,460 Validator.java:261 - > Failed creating a merkle tree for [repair > #66425f10-1c61-11e6-83b2-0b1fff7a067d on keyspace1/standard1, > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)