Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either
if the boxes are idle, you could use jstack and look at the stack… perhaps it's locked somewhere. Worth a shot. On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox tar...@cabotresearch.com wrote: I have a six node cluster in AWS (repl:3) and recently noticed that repair was hanging. I've run with the -pr switch. I see this output in the nodetool command line (and also in that node's system.log): Starting repair command #9, repairing 256 ranges for keyspace dev_a but then no other output. And I see nothing in any of the other node's log files. Right now the application using C* is turned off so there is zero activity. I've let it be in this state for up to 24 hours with nothing more logged. Any suggestions? -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either
On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox tar...@cabotresearch.com wrote: I have a six node cluster in AWS (repl:3) and recently noticed that repair was hanging. I've run with the -pr switch. It'll do that. What version of Cassandra? =Rob
Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either
We're running 1.2.13. Any chance that doing a rolling-restart would help? Would running without the -pr improve the odds? Thanks. On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox tar...@cabotresearch.com wrote: I have a six node cluster in AWS (repl:3) and recently noticed that repair was hanging. I've run with the -pr switch. It'll do that. What version of Cassandra? =Rob
Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either
Does this output from jstack indicate a problem? ReadRepairStage:12170 daemon prio=10 tid=0x7f9dcc018800 nid=0x7361 waiting on condition [0x7f9db540c000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000613e049d8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) ReadRepairStage:12169 daemon prio=10 tid=0x7f9dd4009000 nid=0x7340 waiting on condition [0x7f9db53cb000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000613e049d8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) ReadRepairStage:12168 daemon prio=10 tid=0x7f9dd001d000 nid=0x733f waiting on condition [0x7f9db51a6000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000613e049d8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) On Tue, Jul 1, 2014 at 2:09 PM, Brian Tarbox tar...@cabotresearch.com wrote: We're running 1.2.13. Any chance that doing a rolling-restart would help? Would running without the -pr improve the odds? Thanks. On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox tar...@cabotresearch.com wrote: I have a six node cluster in AWS (repl:3) and recently noticed that repair was hanging. I've run with the -pr switch. It'll do that. What version of Cassandra? =Rob
Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either
On Tue, Jul 1, 2014 at 11:09 AM, Brian Tarbox tar...@cabotresearch.com wrote: We're running 1.2.13. 1.2.17 contains a few streaming fixes which might help. Any chance that doing a rolling-restart would help? Probably not. Would running without the -pr improve the odds? No, that'd make it less likely to succeed. =Rob
Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either
Given that an upgrade is (for various internal reasons) not an option at this point...is there anything I can do to get repair working again? I'll also mention that I see this behavior from all nodes. Thanks. On Tue, Jul 1, 2014 at 2:51 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Jul 1, 2014 at 11:09 AM, Brian Tarbox tar...@cabotresearch.com wrote: We're running 1.2.13. 1.2.17 contains a few streaming fixes which might help. Any chance that doing a rolling-restart would help? Probably not. Would running without the -pr improve the odds? No, that'd make it less likely to succeed. =Rob
Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either
On Tue, Jul 1, 2014 at 11:54 AM, Brian Tarbox tar...@cabotresearch.com wrote: Given that an upgrade is (for various internal reasons) not an option at this point...is there anything I can do to get repair working again? I'll also mention that I see this behavior from all nodes. I think maybe increasing your phi tolerance for streaming timeouts might help. But basically, no. Repair has historically been quite broken in AWS. It was re-written in 2.0 along with the rest of streaming, and hopefully will soon stabilize and actually work. For what purpose are you running repair? =Rob
Re: nodetool repair saying starting and then nothing, and nothing in any of the server logs either
For what purpose are you running repair? Because I read that we should! :-) We do delete data from one column family quite regularly...from the other CFs occasionally. We almost never run with less than 100% of our nodes up. In this configuration do we *need* to run repair? Thanks, On Tue, Jul 1, 2014 at 2:57 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Jul 1, 2014 at 11:54 AM, Brian Tarbox tar...@cabotresearch.com wrote: Given that an upgrade is (for various internal reasons) not an option at this point...is there anything I can do to get repair working again? I'll also mention that I see this behavior from all nodes. I think maybe increasing your phi tolerance for streaming timeouts might help. But basically, no. Repair has historically been quite broken in AWS. It was re-written in 2.0 along with the rest of streaming, and hopefully will soon stabilize and actually work. For what purpose are you running repair? =Rob