[ https://issues.apache.org/jira/browse/CASSANDRA-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13783281#comment-13783281 ]
J.B. Langston edited comment on CASSANDRA-6097 at 10/1/13 8:06 PM: ------------------------------------------------------------------- The JMX documentation [states|http://www.oracle.com/technetwork/java/javase/tech/best-practices-jsp-136021.html#mozTocId387765] that notifications are not guaranteed to always be delivered. The API only guarantees that a client either receives all notifications for which it is listening, or can discover that notifications may have been lost. A client can discover when notifications are lost by registering a listener using JMXConnector.addConnectionNotificationListener. It looks like nodetool isn't doing this last part. Seems like we should register a listener ConnectionNotificationListener and if a notification fails, signal the condition so that nodetool doesn't hang. Maybe have nodetool query for the status of the repair at that point via separate JMX call, or just print a warning that "The status of the repair command can't be determined, please check the log." or something like that. I would disagree with prioritizing this as trivial. It's not critical but I have had many customers express frustration with the nodetool repair's proclivity for hanging. It makes automating repairs painful because they can't count on nodetool to ever return. was (Author: jblangs...@datastax.com): The JMX documentation [states|http://www.oracle.com/technetwork/java/javase/tech/best-practices-jsp-136021.html#mozTocId387765] that notifications are not guaranteed to always be delivered. The API only guarantees that a client either receives all notifications for which it is listening, or can discover that notifications may have been lost. A client can discover when notifications are lost by registering a listener using JMXConnector.addConnectionNotificationListener. It looks like nodetool isn't doing this last part. Seems like we should register a list ConnectionNotificationListener and if a connection fails, signal the condition so that nodetool doesn't hang. Maybe have nodetool query for the status of the repair at that point via separate JMX call, or just print a warning that "The status of the repair command can't be determined, please check the log." or something like that. I would disagree with prioritizing this as trivial. It's not critical but I have had many customers express frustration with the nodetool repair's proclivity for hanging. It makes automating repairs painful because they can't count on nodetool to ever return. > nodetool repair randomly hangs. > ------------------------------- > > Key: CASSANDRA-6097 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6097 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: DataStax AMI > Reporter: J.B. Langston > Priority: Trivial > Attachments: dse.stack, nodetool.stack > > > nodetool repair randomly hangs. This is not the same issue where repair hangs > if a stream is disrupted. This can be reproduced on a single-node cluster > where no streaming takes place, so I think this may be a JMX connection or > timeout issue. Thread dumps show that nodetool is waiting on a JMX response > and there are no repair-related threads running in Cassandra. Nodetool main > thread waiting for JMX response: > {code} > "main" prio=5 tid=7ffa4b001800 nid=0x10aedf000 in Object.wait() [10aede000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition) > at java.lang.Object.wait(Object.java:485) > at > org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34) > - locked <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition) > at > org.apache.cassandra.tools.RepairRunner.repairAndWait(NodeProbe.java:976) > at > org.apache.cassandra.tools.NodeProbe.forceRepairAsync(NodeProbe.java:221) > at > org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1444) > at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1213) > {code} > When nodetool hangs, it does not print out the following message: > "Starting repair command #XX, repairing 1 ranges for keyspace XXX" > However, Cassandra logs that repair in system.log: > 1380033480.95 INFO [Thread-154] 10:38:00,882 Starting repair command #X, > repairing X ranges for keyspace XXX > This suggests that the repair command was received by Cassandra but the > connection then failed and nodetool didn't receive a response. > Obviously, running repair on a single-node cluster is pointless but it's the > easiest way to demonstrate this problem. The customer who reported this has > also seen the issue on his real multi-node cluster. > Steps to reproduce: > Note: I reproduced this once on the official DataStax AMI with DSE 3.1.3 > (Cassandra 1.2.6+patches). I was unable to reproduce on my Mac using the > same version, and subsequent attempts to reproduce it on the AMI were > unsuccessful. The customer says he is able is able to reliably reproduce on > his Mac using DSE 3.1.3 and occasionally reproduce it on his real cluster. > 1) Deploy an AMI using the DataStax AMI at > https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2 > 2) Create a test keyspace > {code} > create keyspace test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > 3) Run an endless loop that runs nodetool repair repeatedly: > {code} > while true; do nodetool repair -pr test; done > {code} > 4) Wait until repair hangs. It may take many tries; the behavior is random. -- This message was sent by Atlassian JIRA (v6.1#6144)