Re: Did not get positive replies from all endpoints error on incremental repair

2014-10-31 Thread Juho Mäkinen
I relaunched my cluster from the scratch (due to another reason). After the
relaunch I could ran nodetool repair -par -inc -pr on the nodes without
issue, but pretty match the moment when I started pushing production load
to the cluster I ran into the same problem again. I opened a ticket first
for adding logging info, but I'll most probably end up adding the logging
by myself and I'll start digging through into the actual root cause.

I also ran one nodetool repair -par (ie. without incremental repair) and it
seems that the repair started. Guess I need to go over the sources if
there's a different code path which would explain this.

I can't yet call this conclusive, but it seems that I can't run incremental
repairs on the current 2.1.1 and I'm still wondering if anybody else is
experiencing the same problem.

On Thu, Oct 30, 2014 at 1:14 PM, Juho Mäkinen juho.maki...@gmail.com
wrote:

 No, the cluster seems to be performing just fine. It seems that the
 prepareForRepair callback() could be easily modified to print which node(s)
 are unable to respond, so that the debugging effort could be focused
 better. This of course doesn't help this case as it's not trivial to add
 the log lines and to roll it out to the entire cluster.

 The cluster is relatively young, containing only 450GB with RF=3 spread
 over nine nodes and I'm still practicing how to run incremental repairs on
 the cluster when I stumbled on this issue.

 On Thu, Oct 30, 2014 at 12:52 PM, Rahul Neelakantan ra...@rahul.be
 wrote:

 It appears to come from the ActiveRepairService.prepareForRepair portion
 of the Code.

 Are you sure all nodes are reachable from the node you are initiating
 repair on, at the same time?

 Any Node up/down/died messages?

 Rahul Neelakantan

  On Oct 30, 2014, at 6:37 AM, Juho Mäkinen juho.maki...@gmail.com
 wrote:
 
  I'm having problems running nodetool repair -inc -par -pr on my 2.1.1
 cluster due to Did not get positive replies from all endpoints error.
 
  Here's an example output:
  root@db08-3:~# nodetool repair -par -inc -pr
  [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
  [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256
 ranges for keyspace profiles (seq=false, full=false)
  [2014-10-30 10:33:17,240] Repair failed with error Did not get positive
 replies from all endpoints.
  [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256
 ranges for keyspace OpsCenter (seq=false, full=false)
  [2014-10-30 10:33:32,242] Repair failed with error Did not get positive
 replies from all endpoints.
  [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256
 ranges for keyspace system_traces (seq=false, full=false)
  [2014-10-30 10:33:44,243] Repair failed with error Did not get positive
 replies from all endpoints.
 
  The local system log shows that the repair commands got started, but it
 seems that they immediately get cancelled due to that error, which btw
 can't be seen in the cassandra log.
 
  I tried monitoring all logs from all machines in case another machine
 would show up with some useful error, but so far I haven't found nothing.
 
  Any ideas where this error comes from?
 
   - Garo
 





Re: Did not get positive replies from all endpoints error on incremental repair

2014-10-31 Thread Robert Coli
On Fri, Oct 31, 2014 at 8:55 AM, Juho Mäkinen juho.maki...@gmail.com
wrote:

 I can't yet call this conclusive, but it seems that I can't run
 incremental repairs on the current 2.1.1 and I'm still wondering if anybody
 else is experiencing the same problem.


You have repro steps, if I were you I would file an JIRA on
http://issues.apache.org.

=Rob


Re: Did not get positive replies from all endpoints error on incremental repair

2014-10-30 Thread Rahul Neelakantan
It appears to come from the ActiveRepairService.prepareForRepair portion of the 
Code.

Are you sure all nodes are reachable from the node you are initiating repair 
on, at the same time?

Any Node up/down/died messages?

Rahul Neelakantan

 On Oct 30, 2014, at 6:37 AM, Juho Mäkinen juho.maki...@gmail.com wrote:
 
 I'm having problems running nodetool repair -inc -par -pr on my 2.1.1 cluster 
 due to Did not get positive replies from all endpoints error.
 
 Here's an example output:
 root@db08-3:~# nodetool repair -par -inc -pr  

 [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
 [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256 ranges 
 for keyspace profiles (seq=false, full=false)
 [2014-10-30 10:33:17,240] Repair failed with error Did not get positive 
 replies from all endpoints.
 [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256 ranges 
 for keyspace OpsCenter (seq=false, full=false)
 [2014-10-30 10:33:32,242] Repair failed with error Did not get positive 
 replies from all endpoints.
 [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256 ranges 
 for keyspace system_traces (seq=false, full=false)
 [2014-10-30 10:33:44,243] Repair failed with error Did not get positive 
 replies from all endpoints.
 
 The local system log shows that the repair commands got started, but it seems 
 that they immediately get cancelled due to that error, which btw can't be 
 seen in the cassandra log.
 
 I tried monitoring all logs from all machines in case another machine would 
 show up with some useful error, but so far I haven't found nothing.
 
 Any ideas where this error comes from?
 
  - Garo
 


Re: Did not get positive replies from all endpoints error on incremental repair

2014-10-30 Thread Juho Mäkinen
No, the cluster seems to be performing just fine. It seems that the
prepareForRepair callback() could be easily modified to print which node(s)
are unable to respond, so that the debugging effort could be focused
better. This of course doesn't help this case as it's not trivial to add
the log lines and to roll it out to the entire cluster.

The cluster is relatively young, containing only 450GB with RF=3 spread
over nine nodes and I'm still practicing how to run incremental repairs on
the cluster when I stumbled on this issue.

On Thu, Oct 30, 2014 at 12:52 PM, Rahul Neelakantan ra...@rahul.be wrote:

 It appears to come from the ActiveRepairService.prepareForRepair portion
 of the Code.

 Are you sure all nodes are reachable from the node you are initiating
 repair on, at the same time?

 Any Node up/down/died messages?

 Rahul Neelakantan

  On Oct 30, 2014, at 6:37 AM, Juho Mäkinen juho.maki...@gmail.com
 wrote:
 
  I'm having problems running nodetool repair -inc -par -pr on my 2.1.1
 cluster due to Did not get positive replies from all endpoints error.
 
  Here's an example output:
  root@db08-3:~# nodetool repair -par -inc -pr
  [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
  [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256
 ranges for keyspace profiles (seq=false, full=false)
  [2014-10-30 10:33:17,240] Repair failed with error Did not get positive
 replies from all endpoints.
  [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256
 ranges for keyspace OpsCenter (seq=false, full=false)
  [2014-10-30 10:33:32,242] Repair failed with error Did not get positive
 replies from all endpoints.
  [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256
 ranges for keyspace system_traces (seq=false, full=false)
  [2014-10-30 10:33:44,243] Repair failed with error Did not get positive
 replies from all endpoints.
 
  The local system log shows that the repair commands got started, but it
 seems that they immediately get cancelled due to that error, which btw
 can't be seen in the cassandra log.
 
  I tried monitoring all logs from all machines in case another machine
 would show up with some useful error, but so far I haven't found nothing.
 
  Any ideas where this error comes from?
 
   - Garo