[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268355#comment-14268355 ] Alan Boudreault edited comment on CASSANDRA-8316 at 1/7/15 10:09 PM: - I did more tests during the afternoon and it looks like I can get the cluster in a correct state after getting the issues. Either by using forceTerminateAllRepairSessions, restarting the cluster or running more repairs. Failed repairs seem to get fixed at a certain point. I think we are good with that patch then! BTW, I've attached a v3 patch, which simply add a import line so we can get the compilation ok. was (Author: aboudreault): I did more tests during the afternoon and it looks like I can get the cluster in a correct state after getting the issues. Either by using forceTerminateAllRepairSessions, restarting the cluster or running more repairs. This seems to get fixed at a certain point. I think we are good with that patch then! BTW, I've attached a v3 patch, which simply add a import line so we can get the compilation ok. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, 8316-v3.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251628#comment-14251628 ] Marcus Eriksson edited comment on CASSANDRA-8316 at 12/18/14 1:13 PM: -- To summarize this; * we had a bug in compaction marking that could make a node end up in an infinite loop, fixed in branch linked above * we allowed multiple repairs over the same sstables, fixed * we had a situation where we didn't remove the parent repair sessions, fixed And, to describe the final problem: # Node A sends a PrepareMessage to overloaded Node B # B starts preparing # A times out waiting for B to prepare # B finishes preparing and marks a bunch of sstables as being repaired # User retries the repair on node A # B gets the new PrepareMessage but sees that the sstables it wants to repair are already being repaired, and refuses to start One solution could be to have A send out a cancel message, another solution could be to have B remove any parent repair sessions after 5 (or something) minutes if it hasn't received a validation message before that. Need [~yukim] input. was (Author: krummas): To summarize this; 1. we had a bug in compaction marking that could make a node end up in an infinite loop, fixed in branch linked above 2. we allowed multiple repairs over the same sstables, fixed 3. we had a situation where we didn't remove the parent repair sessions, fixed And, to describe the final problem: # Node A sends a PrepareMessage to overloaded Node B # B starts preparing # A times out waiting for B to prepare # B finishes preparing and marks a bunch of sstables as being repaired # User retries the repair on node A # B gets the new PrepareMessage but sees that the sstables it wants to repair are already being repaired, and refuses to start One solution could be to have A send out a cancel message, another solution could be to have B remove any parent repair sessions after 5 (or something) minutes if it hasn't received a validation message before that. Need [~yukim] input. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249690#comment-14249690 ] Marcus Eriksson edited comment on CASSANDRA-8316 at 12/17/14 10:27 AM: --- I think we are simply timing out the Prepare message when TRACE is enabled (I can't even start a 8 node cluster with TRACE on) One solution could be to increase the timeout for this message, but we use the same timeout for snapshot creation and that would be just as likely to fail on a heavily loaded cluster, wdyt [~yukim]? Also, note, that in your test you repair all ranges, meaning, when you repair node5 for example, you actually include node3,4,5,6,7, so you can't repair any of those at the same time was (Author: krummas): I think we are simply timing out the Prepare message when TRACE is enabled (I can't even start a 8 node cluster with TRACE on) One solution could be to increase the timeout, but we use the same timeout for snapshot creation and that would be just as likely to fail on a heavily loaded cluster, wdyt [~yukim]? Also, note, that in your test you repair all ranges, meaning, when you repair node5 for example, you actually include node3,4,5,6,7, so you can't repair any of those at the same time Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246153#comment-14246153 ] Alan Boudreault edited comment on CASSANDRA-8316 at 12/14/14 9:59 PM: -- [~krummas], unfortunately, I'm still getting the initial issue AND the issue introduced on multiple nodes with the v2 patch + CASSANDRA-8458. 8 nodes cluster and cassandra-stress with n=100 {code} aboudreault@kovarro:~/dev/cstar/8316$ ccm node3 nodetool -- repair -par -inc [2014-12-14 16:30:48,037] Starting repair command #1, repairing 3 ranges for keyspace r1 (parallelism=PARALLEL, full=false) [2014-12-14 16:30:48,040] Repair failed with error Already repairing SSTableReader(path='/home/aboudreault/.ccm/local/node3/data/r1/Standard1-d11383e083d411e4869ab56034537865/r1-Standard1-ka-38-Data.db'), can not continue. {code} In addition, I noticed that it took 20 minutes restart my cluster this time. Not sure if it's related to this issue but I've attached a yourkit snapshot. Let me know if I can anything else. was (Author: aboudreault): [~krummas], unfortunately, I'm still getting the same issue introduced on multiple nodes with the v2 patch + CASSANDRA-8458. 8 nodes cluster and cassandra-stress with n=100 {code} aboudreault@kovarro:~/dev/cstar/8316$ ccm node3 nodetool -- repair -par -inc [2014-12-14 16:30:48,037] Starting repair command #1, repairing 3 ranges for keyspace r1 (parallelism=PARALLEL, full=false) [2014-12-14 16:30:48,040] Repair failed with error Already repairing SSTableReader(path='/home/aboudreault/.ccm/local/node3/data/r1/Standard1-d11383e083d411e4869ab56034537865/r1-Standard1-ka-38-Data.db'), can not continue. {code} In addition, I noticed that it took 20 minutes restart my cluster this time. Not sure if it's related to this issue but I've attached a yourkit snapshot. Let me know if I can anything else. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241153#comment-14241153 ] Marcus Eriksson edited comment on CASSANDRA-8316 at 12/10/14 2:42 PM: -- yep, will have a look, seems that we don't clear out the parent repair session on this failure mode was (Author: krummas): yep, will have a look, seems that we don't clear out the repair session on this failure mode Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224896#comment-14224896 ] Alan Boudreault edited comment on CASSANDRA-8316 at 11/25/14 5:43 PM: -- [~krummas] [~llambiel] I have been able to reproduce this issue many times. However, not sure exactly where the problem is. I've attached a small bash script that I used to test: * I run a cluster of 8 nodes * I set the cluster loglevel to TRACE (very important to reproduce the issue, we have to slow things) * I stress with n=50 with RF=3 * I start 3 nodetool repair -par -inc in parallel (Important, starting only 2 repairs doesn't produce the issue everytime. it depends on the system load I guess) * The issue is related to incremental repair. I can't reproduce the issue with seq repair and/or par not incremental repair. * The compaction strategy is not related. I can reproduce the issue with LCS and STCS. Basically, the issue happens when the system is very busy and doesn't response enough fast. In then function sendRR of MessagingService.java, a callback is added withn a timeout of 10 seconds. The endpoint doesn't response in 10 seconds, so we get the error. However, even if we increase that timeout to 100 seconds in example, the system doesn't get better and the load is still very high. We just get the error message Lost notification, check server log for repair state of keyspace ... instead of Repair failed with error Did not get positive replies from all endpoints.. When the load is high (event after the repair), I checked quickly with yourkit and what taking a lot of cpu time is the AntiEntropyStage thread, so the ActiveRepairService that never ends? Let me know if you I go deeper in the profiling, perhaps I could get a better profiling by enabling a cassandra agent + yourkit. was (Author: aboudreault): [~krummas] [~llambiel] I have been able to reproduce this issue many times. However, not sure exactly where the problem is. I've attached a small bash script that I used to test: * I run a cluster of 8 nodes * I set the cluster loglevel to TRACE (very important to reproduce the issue, we have to slow things) * I stress with n=50 with RF=3 and * I start 3 nodetool repair -par -inc in parallel (Important, starting only 2 repairs doesn't produce the issue everytime. it depends on the system load I guess) * The issue is related to incremental repair. I can't reproduce the issue with seq repair and/or par not incremental repair. * The compaction strategy is not related. I can reproduce the issue with LCS and STCS. Basically, the issue happens when the system is very busy and doesn't response enough fast. In then function sendRR of MessagingService.java, a callback is added withn a timeout of 10 seconds. The endpoint doesn't response in 10 seconds, so we get the error. However, even if we increase that timeout to 100 seconds in example, the system doesn't get better and the load is still very high. We just get the error message Lost notification, check server log for repair state of keyspace ... instead of Repair failed with error Did not get positive replies from all endpoints.. When the load is high (event after the repair), I checked quickly with yourkit and what taking a lot of cpu time is the AntiEntropyStage thread, so the ActiveRepairService that never ends? Let me know if you I go deeper in the profiling, perhaps I could get a better profiling by enabling a cassandra agent + yourkit. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Alan Boudreault Attachments: test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then
[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224896#comment-14224896 ] Alan Boudreault edited comment on CASSANDRA-8316 at 11/25/14 5:43 PM: -- [~krummas] [~llambiel] I have been able to reproduce this issue many times. However, not sure exactly where the problem is. I've attached a small bash script that I used to test: * I run a cluster of 8 nodes * I set the cluster loglevel to TRACE (very important to reproduce the issue, we have to slow things) * I stress with n=50 with RF=3 and * I start 3 nodetool repair -par -inc in parallel (Important, starting only 2 repairs doesn't produce the issue everytime. it depends on the system load I guess) * The issue is related to incremental repair. I can't reproduce the issue with seq repair and/or par not incremental repair. * The compaction strategy is not related. I can reproduce the issue with LCS and STCS. Basically, the issue happens when the system is very busy and doesn't response enough fast. In then function sendRR of MessagingService.java, a callback is added withn a timeout of 10 seconds. The endpoint doesn't response in 10 seconds, so we get the error. However, even if we increase that timeout to 100 seconds in example, the system doesn't get better and the load is still very high. We just get the error message Lost notification, check server log for repair state of keyspace ... instead of Repair failed with error Did not get positive replies from all endpoints.. When the load is high (event after the repair), I checked quickly with yourkit and what taking a lot of cpu time is the AntiEntropyStage thread, so the ActiveRepairService that never ends? Let me know if you I go deeper in the profiling, perhaps I could get a better profiling by enabling a cassandra agent + yourkit. was (Author: aboudreault): [~krummas] [~llambiel] I have been able to reproduce this issue many times. However, not sure exactly where the problem is. I've attached a small bash script that I used to test: * I run a cluster of 8 nodes * I set the clustet loglevel to TRACE (very important to reproduce the issue, we have to slow things) * I stress with n=50 with RF=3 and * I start 3 nodetool repair -par -inc in parallel (Important, starting only 2 repairs doesn't produce the issue everytime. it depends on the system load I guess) * The issue is related to incremental repair. I can't reproduce the issue with seq repair and/or par not incremental repair. * The compaction strategy is not related. I can reproduce the issue with LCS and STCS. Basically, the issue happens when the system is very busy and doesn't response enough fast. In then function sendRR of MessagingService.java, a callback is added withn a timeout of 10 seconds. The endpoint doesn't response in 10 seconds, so we get the error. However, even if we increase that timeout to 100 seconds in example, the system doesn't get better and the load is still very high. We just get the error message Lost notification, check server log for repair state of keyspace ... instead of Repair failed with error Did not get positive replies from all endpoints.. When the load is high (event after the repair), I checked quickly with yourkit and what taking a lot of cpu time is the AntiEntropyStage thread, so the ActiveRepairService that never ends? Let me know if you I go deeper in the profiling, perhaps I could get a better profiling by enabling a cassandra agent + yourkit. Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Alan Boudreault Attachments: test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then