Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either

2014-07-01 Thread Kevin Burton
if the boxes are idle, you could use jstack and look at the stack… perhaps
it's locked somewhere.

Worth a shot.


On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox tar...@cabotresearch.com
wrote:

 I have a six node cluster in AWS (repl:3) and recently noticed that repair
 was hanging.  I've run with the -pr switch.

 I see this output in the nodetool command line (and also in that node's
 system.log):
  Starting repair command #9, repairing 256 ranges for keyspace dev_a

 but then no other output.  And I see nothing in any of the other node's
 log files.

 Right now the application using C* is turned off so there is zero activity.
 I've let it be in this state for up to 24 hours with nothing more logged.

 Any suggestions?




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either

2014-07-01 Thread Robert Coli
On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox tar...@cabotresearch.com
wrote:

 I have a six node cluster in AWS (repl:3) and recently noticed that repair
 was hanging.  I've run with the -pr switch.


It'll do that.

What version of Cassandra?

=Rob


Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either

2014-07-01 Thread Brian Tarbox
We're running 1.2.13.

Any chance that doing a rolling-restart would help?

Would running without the -pr improve the odds?

Thanks.


On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 I have a six node cluster in AWS (repl:3) and recently noticed that
 repair was hanging.  I've run with the -pr switch.


 It'll do that.

 What version of Cassandra?

 =Rob




Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either

2014-07-01 Thread Brian Tarbox
Does this output from jstack indicate a problem?

ReadRepairStage:12170 daemon prio=10 tid=0x7f9dcc018800 nid=0x7361
waiting on condition [0x7f9db540c000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000613e049d8 (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

ReadRepairStage:12169 daemon prio=10 tid=0x7f9dd4009000 nid=0x7340
waiting on condition [0x7f9db53cb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000613e049d8 (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

ReadRepairStage:12168 daemon prio=10 tid=0x7f9dd001d000 nid=0x733f
waiting on condition [0x7f9db51a6000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000613e049d8 (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)




On Tue, Jul 1, 2014 at 2:09 PM, Brian Tarbox tar...@cabotresearch.com
wrote:

 We're running 1.2.13.

 Any chance that doing a rolling-restart would help?

 Would running without the -pr improve the odds?

 Thanks.


 On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 I have a six node cluster in AWS (repl:3) and recently noticed that
 repair was hanging.  I've run with the -pr switch.


 It'll do that.

 What version of Cassandra?

 =Rob






Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either

2014-07-01 Thread Robert Coli
On Tue, Jul 1, 2014 at 11:09 AM, Brian Tarbox tar...@cabotresearch.com
wrote:

 We're running 1.2.13.


1.2.17 contains a few streaming fixes which might help.


 Any chance that doing a rolling-restart would help?


Probably not.


 Would running without the -pr improve the odds?


No, that'd make it less likely to succeed.

=Rob


Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either

2014-07-01 Thread Brian Tarbox
Given that an upgrade is (for various internal reasons) not an option at
this point...is there anything I can do to get repair working again?  I'll
also mention that I see this behavior from all nodes.

Thanks.


On Tue, Jul 1, 2014 at 2:51 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Jul 1, 2014 at 11:09 AM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 We're running 1.2.13.


 1.2.17 contains a few streaming fixes which might help.


 Any chance that doing a rolling-restart would help?


 Probably not.


 Would running without the -pr improve the odds?


 No, that'd make it less likely to succeed.

 =Rob




Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either

2014-07-01 Thread Robert Coli
On Tue, Jul 1, 2014 at 11:54 AM, Brian Tarbox tar...@cabotresearch.com
wrote:

 Given that an upgrade is (for various internal reasons) not an option at
 this point...is there anything I can do to get repair working again?  I'll
 also mention that I see this behavior from all nodes.


I think maybe increasing your phi tolerance for streaming timeouts might
help.

But basically, no. Repair has historically been quite broken in AWS. It was
re-written in 2.0 along with the rest of streaming, and hopefully will soon
stabilize and actually work.

For what purpose are you running repair?

=Rob


Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either

2014-07-01 Thread Brian Tarbox
For what purpose are you running repair?   Because I read that we should!
:-)

We do delete data from one column family quite regularly...from the other
CFs occasionally.  We almost never run with less than 100% of our nodes up.

In this configuration do we *need* to run repair?

Thanks,


On Tue, Jul 1, 2014 at 2:57 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Jul 1, 2014 at 11:54 AM, Brian Tarbox tar...@cabotresearch.com
 wrote:

 Given that an upgrade is (for various internal reasons) not an option at
 this point...is there anything I can do to get repair working again?  I'll
 also mention that I see this behavior from all nodes.


 I think maybe increasing your phi tolerance for streaming timeouts might
 help.

 But basically, no. Repair has historically been quite broken in AWS. It
 was re-written in 2.0 along with the rest of streaming, and hopefully will
 soon stabilize and actually work.

 For what purpose are you running repair?

 =Rob