[
https://issues.apache.org/jira/browse/CASSANDRA-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuki Morishita reassigned CASSANDRA-6097:
-----------------------------------------
Assignee: Yuki Morishita
> nodetool repair randomly hangs.
> -------------------------------
>
> Key: CASSANDRA-6097
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6097
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: DataStax AMI
> Reporter: J.B. Langston
> Assignee: Yuki Morishita
> Priority: Minor
> Attachments: dse.stack, nodetool.stack
>
>
> nodetool repair randomly hangs. This is not the same issue where repair hangs
> if a stream is disrupted. This can be reproduced on a single-node cluster
> where no streaming takes place, so I think this may be a JMX connection or
> timeout issue. Thread dumps show that nodetool is waiting on a JMX response
> and there are no repair-related threads running in Cassandra. Nodetool main
> thread waiting for JMX response:
> {code}
> "main" prio=5 tid=7ffa4b001800 nid=0x10aedf000 in Object.wait() [10aede000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
> at java.lang.Object.wait(Object.java:485)
> at
> org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34)
> - locked <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
> at
> org.apache.cassandra.tools.RepairRunner.repairAndWait(NodeProbe.java:976)
> at
> org.apache.cassandra.tools.NodeProbe.forceRepairAsync(NodeProbe.java:221)
> at
> org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1444)
> at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1213)
> {code}
> When nodetool hangs, it does not print out the following message:
> "Starting repair command #XX, repairing 1 ranges for keyspace XXX"
> However, Cassandra logs that repair in system.log:
> 1380033480.95 INFO [Thread-154] 10:38:00,882 Starting repair command #X,
> repairing X ranges for keyspace XXX
> This suggests that the repair command was received by Cassandra but the
> connection then failed and nodetool didn't receive a response.
> Obviously, running repair on a single-node cluster is pointless but it's the
> easiest way to demonstrate this problem. The customer who reported this has
> also seen the issue on his real multi-node cluster.
> Steps to reproduce:
> Note: I reproduced this once on the official DataStax AMI with DSE 3.1.3
> (Cassandra 1.2.6+patches). I was unable to reproduce on my Mac using the
> same version, and subsequent attempts to reproduce it on the AMI were
> unsuccessful. The customer says he is able is able to reliably reproduce on
> his Mac using DSE 3.1.3 and occasionally reproduce it on his real cluster.
> 1) Deploy an AMI using the DataStax AMI at
> https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2
> 2) Create a test keyspace
> {code}
> create keyspace test WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': 1};
> {code}
> 3) Run an endless loop that runs nodetool repair repeatedly:
> {code}
> while true; do nodetool repair -pr test; done
> {code}
> 4) Wait until repair hangs. It may take many tries; the behavior is random.
--
This message was sent by Atlassian JIRA
(v6.1#6144)