[jira] [Commented] (CASSANDRA-7904) Repair hangs

Razi Khaja (JIRA) Mon, 15 Sep 2014 14:00:50 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134473#comment-14134473
 ]


Razi Khaja commented on CASSANDRA-7904:
---------------------------------------

Duncan,
Sorry for my misunderstanding, but the linked ticket: 
https://issues.apache.org/jira/browse/CASSANDRA-6651?focusedCommentId=13892345&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13892345
 states:

{quote}
 Thunder Stumpges added a comment - 05/Feb/14 12:45

FWIW we have this exact same issue. We are running 2.0.3 on a 3 node cluster. 
It has happened multiple times, and happens more times than not when running 
nodetool repair. There is nearly always one or more AntiEntropySessions 
remaining according to tpstats.

One strange thing about the behavior I see is that the output of nodetool 
compactionstats returns 0 active compactions, yet when restarting, we get the 
exception about "Unfinished compactions reference missing sstables." It does 
seem like these two issues are related.

Another thing I see sometimes in the ouput from nodetool repair is the 
following message:
[2014-02-04 14:07:30,858] Starting repair command #7, repairing 768 ranges for 
keyspace thunder_test
[2014-02-04 14:08:30,862] Lost notification. You should check server log for 
repair status of keyspace thunder_test
[2014-02-04 14:08:30,870] Starting repair command #8, repairing 768 ranges for 
keyspace doan_synset
[2014-02-04 14:09:30,874] Lost notification. You should check server log for 
repair status of keyspace doan_synset

When this happens, it starts the next repair session immediately rather than 
waiting for the current one to finish. This doesn't however seem to always 
correlate to a hung session.

My logs don't look much/any different from the OP, but I'd be glad to provide 
any more details that might be helpful. We will be upgrading to 2.0.4 in the 
next couple days and I will report back if we see any difference in behavior.
{quote}

Which is why I believe I had the same issue.  I'll move my discussion to the 
newly created ticket CASSANDRA-7909



> Repair hangs
> ------------
>
>                 Key: CASSANDRA-7904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>            Reporter: Duncan Sands
>         Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
>     pending tasks: 0
>     Active compaction remaining time :        n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
>     Mode: NORMAL
>     Not sending any streams.
>     Read Repair Statistics:
>     Attempted: 4233
>     Mismatch (Blocking): 0
>     Mismatch (Background): 243
>     Pool Name                    Active   Pending      Completed
>     Commands                        n/a         0       34785445
>     Responses                       n/a         1       38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
>     Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
>     Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
>     Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
>     The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-7904) Repair hangs

Reply via email to