subject:"\[jira\] \[Comment Edited\] \(CASSANDRA\-8316\) Did not get positive replies from all endpoints error on incremental repair"

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2015-01-07 Thread Alan Boudreault (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268355#comment-14268355
]

Alan Boudreault edited comment on CASSANDRA-8316 at 1/7/15 10:09 PM:
-

I did more tests during the afternoon and it looks like I can get the cluster
in a correct state after getting the issues. Either by using
forceTerminateAllRepairSessions, restarting the cluster or running more
repairs. Failed repairs seem to get fixed at a certain point.

I think we are good with that patch then! BTW, I've attached a v3 patch, which
simply add a import line so we can get the compilation ok.

was (Author: aboudreault):
I did more tests during the afternoon and it looks like I can get the cluster
in a correct state after getting the issues. Either by using
forceTerminateAllRepairSessions, restarting the cluster or running more
repairs. This seems to get fixed at a certain point.

I think we are good with that patch then! BTW, I've attached a v3 patch, which
simply add a import line so we can get the compilation ok.

Did not get positive replies from all endpoints error on incremental repair
--

Key: CASSANDRA-8316
URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
Project: Cassandra
Issue Type: Bug
Components: Core
Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
Fix For: 2.1.3

Attachments: 0001-patch.patch, 8316-v2.patch, 8316-v3.patch,
CassandraDaemon-2014-11-25-2.snapshot.tar.gz,
CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh

Hi,
I've got an issue with incremental repairs on our production 15 nodes 2.1.2
(new cluster, not yet loaded, RF=3)
After having successfully performed an incremental repair (-par -inc) on 3
nodes, I started receiving Repair failed with error Did not get positive
replies from all endpoints. from nodetool on all remaining nodes :
[2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges
for keyspace (seq=false, full=false)
[2014-11-14 09:12:47,919] Repair failed with error Did not get positive
replies from all endpoints.
All the nodes are up and running and the local system log shows that the
repair commands got started and that's it.
I've also noticed that soon after the repair, several nodes started having
more cpu load indefinitely without any particular reason (no tasks / queries,
nothing in the logs). I then restarted C* on these nodes and retried the
repair on several nodes, which were successful until facing the issue again.
I tried to repro on our 3 nodes preproduction cluster without success
It looks like I'm not the only one having this issue:
http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
Any idea?
Thanks
Loic

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-18 Thread Marcus Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251628#comment-14251628
 ] 

Marcus Eriksson edited comment on CASSANDRA-8316 at 12/18/14 1:13 PM:
--

To summarize this;
* we had a bug in compaction marking that could make a node end up in an 
infinite loop, fixed in branch linked above
* we allowed multiple repairs over the same sstables, fixed
* we had a situation where we didn't remove the parent repair sessions, fixed

And, to describe the final problem:

# Node A sends a PrepareMessage to overloaded Node B
# B starts preparing
# A times out waiting for B to prepare
# B finishes preparing and marks a bunch of sstables as being repaired
# User retries the repair on node A
# B gets the new PrepareMessage but sees that the sstables it wants to repair 
are already being repaired, and refuses to start

One solution could be to have A send out a cancel message, another solution 
could be to have B remove any parent repair sessions after 5 (or something) 
minutes if it hasn't received a validation message before that. Need [~yukim] 
input.


was (Author: krummas):
To summarize this;
1. we had a bug in compaction marking that could make a node end up in an 
infinite loop, fixed in branch linked above
2. we allowed multiple repairs over the same sstables, fixed
3. we had a situation where we didn't remove the parent repair sessions, fixed

And, to describe the final problem:

# Node A sends a PrepareMessage to overloaded Node B
# B starts preparing
# A times out waiting for B to prepare
# B finishes preparing and marks a bunch of sstables as being repaired
# User retries the repair on node A
# B gets the new PrepareMessage but sees that the sstables it wants to repair 
are already being repaired, and refuses to start

One solution could be to have A send out a cancel message, another solution 
could be to have B remove any parent repair sessions after 5 (or something) 
minutes if it hasn't received a validation message before that. Need [~yukim] 
input.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-17 Thread Marcus Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249690#comment-14249690
 ] 

Marcus Eriksson edited comment on CASSANDRA-8316 at 12/17/14 10:27 AM:
---

I think we are simply timing out the Prepare message when TRACE is enabled (I 
can't even start a 8 node cluster with TRACE on)

One solution could be to increase the timeout for this message, but we use the 
same timeout for snapshot creation and that would be just as likely to fail on 
a heavily loaded cluster, wdyt [~yukim]?

Also, note, that in your test you repair all ranges, meaning, when you repair 
node5 for example, you actually include node3,4,5,6,7, so you can't repair any 
of those at the same time



was (Author: krummas):
I think we are simply timing out the Prepare message when TRACE is enabled (I 
can't even start a 8 node cluster with TRACE on)

One solution could be to increase the timeout, but we use the same timeout for 
snapshot creation and that would be just as likely to fail on a heavily loaded 
cluster, wdyt [~yukim]?

Also, note, that in your test you repair all ranges, meaning, when you repair 
node5 for example, you actually include node3,4,5,6,7, so you can't repair any 
of those at the same time


  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-14 Thread Alan Boudreault (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246153#comment-14246153
 ] 

Alan Boudreault edited comment on CASSANDRA-8316 at 12/14/14 9:59 PM:
--

[~krummas], unfortunately, I'm still getting the initial issue AND the issue 
introduced on multiple nodes with the v2 patch + CASSANDRA-8458. 8 nodes 
cluster and cassandra-stress with n=100

{code}
aboudreault@kovarro:~/dev/cstar/8316$ ccm node3 nodetool -- repair -par -inc 
[2014-12-14 16:30:48,037] Starting repair command #1, repairing 3 ranges for 
keyspace r1 (parallelism=PARALLEL, full=false)
[2014-12-14 16:30:48,040] Repair failed with error Already repairing 
SSTableReader(path='/home/aboudreault/.ccm/local/node3/data/r1/Standard1-d11383e083d411e4869ab56034537865/r1-Standard1-ka-38-Data.db'),
 can not continue.
{code}

In addition, I  noticed that it took 20 minutes restart my cluster this time. 
Not sure if it's related to this issue but I've attached a yourkit snapshot.

Let me know if I can anything else.


was (Author: aboudreault):
[~krummas], unfortunately, I'm still getting the same issue introduced on 
multiple nodes with the v2 patch + CASSANDRA-8458. 8 nodes cluster and 
cassandra-stress with n=100

{code}
aboudreault@kovarro:~/dev/cstar/8316$ ccm node3 nodetool -- repair -par -inc 
[2014-12-14 16:30:48,037] Starting repair command #1, repairing 3 ranges for 
keyspace r1 (parallelism=PARALLEL, full=false)
[2014-12-14 16:30:48,040] Repair failed with error Already repairing 
SSTableReader(path='/home/aboudreault/.ccm/local/node3/data/r1/Standard1-d11383e083d411e4869ab56034537865/r1-Standard1-ka-38-Data.db'),
 can not continue.
{code}

In addition, I  noticed that it took 20 minutes restart my cluster this time. 
Not sure if it's related to this issue but I've attached a yourkit snapshot.

Let me know if I can anything else.

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 8316-v2.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, 
 CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-12-10 Thread Marcus Eriksson (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241153#comment-14241153
 ] 

Marcus Eriksson edited comment on CASSANDRA-8316 at 12/10/14 2:42 PM:
--

yep, will have a look, seems that we don't clear out the parent repair session 
on this failure mode


was (Author: krummas):
yep, will have a look, seems that we don't clear out the repair session on this 
failure mode

  Did not get positive replies from all endpoints error on incremental repair
 --

 Key: CASSANDRA-8316
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra 2.1.2
Reporter: Loic Lambiel
Assignee: Marcus Eriksson
 Fix For: 2.1.3

 Attachments: 0001-patch.patch, 
 CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh


 Hi,
 I've got an issue with incremental repairs on our production 15 nodes 2.1.2 
 (new cluster, not yet loaded, RF=3)
 After having successfully performed an incremental repair (-par -inc) on 3 
 nodes, I started receiving Repair failed with error Did not get positive 
 replies from all endpoints. from nodetool on all remaining nodes :
 [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges 
 for keyspace  (seq=false, full=false)
 [2014-11-14 09:12:47,919] Repair failed with error Did not get positive 
 replies from all endpoints.
 All the nodes are up and running and the local system log shows that the 
 repair commands got started and that's it.
 I've also noticed that soon after the repair, several nodes started having 
 more cpu load indefinitely without any particular reason (no tasks / queries, 
 nothing in the logs). I then restarted C* on these nodes and retried the 
 repair on several nodes, which were successful until facing the issue again.
 I tried to repro on our 3 nodes preproduction cluster without success
 It looks like I'm not the only one having this issue: 
 http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
 Any idea?
 Thanks
 Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-11-25 Thread Alan Boudreault (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224896#comment-14224896
]

Alan Boudreault edited comment on CASSANDRA-8316 at 11/25/14 5:43 PM:
--

[~krummas] [~llambiel] I have been able to reproduce this issue many times.
However, not sure exactly where the problem is. I've attached a small bash
script that I used to test:

* I run a cluster of 8 nodes
* I set the cluster loglevel to TRACE (very important to reproduce the issue,
we have to slow things)
* I stress with n=50 with RF=3
* I start 3 nodetool repair -par -inc in parallel (Important, starting only
2 repairs doesn't produce the issue everytime. it depends on the system load I
guess)
* The issue is related to incremental repair. I can't reproduce the issue with
seq repair and/or par not incremental repair.
* The compaction strategy is not related. I can reproduce the issue with LCS
and STCS.

Basically, the issue happens when the system is very busy and doesn't response
enough fast. In then function sendRR of MessagingService.java, a callback is
added withn a timeout of 10 seconds. The endpoint doesn't response in 10
seconds, so we get the error. However, even if we increase that timeout to 100
seconds in example, the system doesn't get better and the load is still very
high. We just get the error message Lost notification, check server log for
repair state of keyspace ... instead of Repair failed with error Did not get
positive replies from all endpoints.. When the load is high (event after the
repair), I checked quickly with yourkit and what taking a lot of cpu time is
the AntiEntropyStage thread, so the ActiveRepairService that never ends?

Let me know if you I go deeper in the profiling, perhaps I could get a better
profiling by enabling a cassandra agent + yourkit.

was (Author: aboudreault):
[~krummas] [~llambiel] I have been able to reproduce this issue many times.
However, not sure exactly where the problem is. I've attached a small bash
script that I used to test:

* I run a cluster of 8 nodes
* I set the cluster loglevel to TRACE (very important to reproduce the issue,
we have to slow things)
* I stress with n=50 with RF=3 and
* I start 3 nodetool repair -par -inc in parallel (Important, starting only
2 repairs doesn't produce the issue everytime. it depends on the system load I
guess)
* The issue is related to incremental repair. I can't reproduce the issue with
seq repair and/or par not incremental repair.
* The compaction strategy is not related. I can reproduce the issue with LCS
and STCS.

Let me know if you I go deeper in the profiling, perhaps I could get a better
profiling by enabling a cassandra agent + yourkit.

Did not get positive replies from all endpoints error on incremental repair
--

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

2014-11-25 Thread Alan Boudreault (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224896#comment-14224896
]

Alan Boudreault edited comment on CASSANDRA-8316 at 11/25/14 5:43 PM:
--

[~krummas] [~llambiel] I have been able to reproduce this issue many times.
However, not sure exactly where the problem is. I've attached a small bash
script that I used to test:

Let me know if you I go deeper in the profiling, perhaps I could get a better
profiling by enabling a cassandra agent + yourkit.

* I run a cluster of 8 nodes
* I set the clustet loglevel to TRACE (very important to reproduce the issue,
we have to slow things)
* I stress with n=50 with RF=3 and
* I start 3 nodetool repair -par -inc in parallel (Important, starting only
2 repairs doesn't produce the issue everytime. it depends on the system load I
guess)
* The issue is related to incremental repair. I can't reproduce the issue with
seq repair and/or par not incremental repair.
* The compaction strategy is not related. I can reproduce the issue with LCS
and STCS.

Let me know if you I go deeper in the profiling, perhaps I could get a better
profiling by enabling a cassandra agent + yourkit.

Did not get positive replies from all endpoints error on incremental repair
--

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair

7 matches

Site Navigation

Mail list logo

Footer information