[ 
https://issues.apache.org/jira/browse/CASSANDRA-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13783281#comment-13783281
 ] 

J.B. Langston edited comment on CASSANDRA-6097 at 10/1/13 8:06 PM:
-------------------------------------------------------------------

The JMX documentation 
[states|http://www.oracle.com/technetwork/java/javase/tech/best-practices-jsp-136021.html#mozTocId387765]
 that notifications are not guaranteed to always be delivered.  The API only 
guarantees that a client either receives all notifications for which it is 
listening, or can discover that notifications may have been lost. A client can 
discover when notifications are lost by registering a listener using 
JMXConnector.addConnectionNotificationListener. It looks like nodetool isn't 
doing this last part. Seems like we should register a listener 
ConnectionNotificationListener and if a notification fails, signal the 
condition so that nodetool doesn't hang. Maybe have nodetool query for the 
status of the repair at that point via separate JMX call, or just print a 
warning that "The status of the repair command can't be determined, please 
check the log." or something like that.

I would disagree with prioritizing this as trivial. It's not critical but I 
have had many customers express frustration with the  nodetool repair's 
proclivity for hanging.  It makes automating repairs painful because they can't 
count on nodetool to ever return.


was (Author: jblangs...@datastax.com):
The JMX documentation 
[states|http://www.oracle.com/technetwork/java/javase/tech/best-practices-jsp-136021.html#mozTocId387765]
 that notifications are not guaranteed to always be delivered.  The API only 
guarantees that a client either receives all notifications for which it is 
listening, or can discover that notifications may have been lost. A client can 
discover when notifications are lost by registering a listener using 
JMXConnector.addConnectionNotificationListener. It looks like nodetool isn't 
doing this last part. Seems like we should register a list 
ConnectionNotificationListener and if a connection fails, signal the condition 
so that nodetool doesn't hang. Maybe have nodetool query for the status of the 
repair at that point via separate JMX call, or just print a warning that "The 
status of the repair command can't be determined, please check the log." or 
something like that.

I would disagree with prioritizing this as trivial. It's not critical but I 
have had many customers express frustration with the  nodetool repair's 
proclivity for hanging.  It makes automating repairs painful because they can't 
count on nodetool to ever return.

> nodetool repair randomly hangs.
> -------------------------------
>
>                 Key: CASSANDRA-6097
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6097
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DataStax AMI
>            Reporter: J.B. Langston
>            Priority: Trivial
>         Attachments: dse.stack, nodetool.stack
>
>
> nodetool repair randomly hangs. This is not the same issue where repair hangs 
> if a stream is disrupted. This can be reproduced on a single-node cluster 
> where no streaming takes place, so I think this may be a JMX connection or 
> timeout issue. Thread dumps show that nodetool is waiting on a JMX response 
> and there are no repair-related threads running in Cassandra. Nodetool main 
> thread waiting for JMX response:
> {code}
> "main" prio=5 tid=7ffa4b001800 nid=0x10aedf000 in Object.wait() [10aede000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
>       at java.lang.Object.wait(Object.java:485)
>       at 
> org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34)
>       - locked <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
>       at 
> org.apache.cassandra.tools.RepairRunner.repairAndWait(NodeProbe.java:976)
>       at 
> org.apache.cassandra.tools.NodeProbe.forceRepairAsync(NodeProbe.java:221)
>       at 
> org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1444)
>       at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1213)
> {code}
> When nodetool hangs, it does not print out the following message:
> "Starting repair command #XX, repairing 1 ranges for keyspace XXX"
> However, Cassandra logs that repair in system.log:
> 1380033480.95  INFO [Thread-154] 10:38:00,882 Starting repair command #X, 
> repairing X ranges for keyspace XXX
> This suggests that the repair command was received by Cassandra but the 
> connection then failed and nodetool didn't receive a response.
> Obviously, running repair on a single-node cluster is pointless but it's the 
> easiest way to demonstrate this problem. The customer who reported this has 
> also seen the issue on his real multi-node cluster.
> Steps to reproduce:
> Note: I reproduced this once on the official DataStax AMI with DSE 3.1.3 
> (Cassandra 1.2.6+patches).  I was unable to reproduce on my Mac using the 
> same version, and subsequent attempts to reproduce it on the AMI were 
> unsuccessful. The customer says he is able is able to reliably reproduce on 
> his Mac using DSE 3.1.3 and occasionally reproduce it on his real cluster. 
> 1) Deploy an AMI using the DataStax AMI at 
> https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2
> 2) Create a test keyspace
> {code}
> create keyspace test WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> 3) Run an endless loop that runs nodetool repair repeatedly:
> {code}
> while true; do nodetool repair -pr test; done
> {code}
> 4) Wait until repair hangs. It may take many tries; the behavior is random.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to