[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

Razi Khaja (JIRA) Fri, 12 Sep 2014 14:32:11 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132122#comment-14132122
 ]


Razi Khaja edited comment on CASSANDRA-7904 at 9/12/14 9:30 PM:
----------------------------------------------------------------

We have 3 data centers, each with 4 nodes (running on physical machines not on 
EC2). We have been running Cassandra 2.0.6 and have not been able to 
successfully run *nodetool repair* on any of our nodes (except when no data or 
almost no data was loaded into our keyspaces). We upgraded to Cassandra 2.0.10 
hoping that this issue of *Lost notification* during the *nodetool repair* 
would be fixed, but as you can see from the log below, we still have not been 
able to successfully run *nodetool repair*.

{code}
[2014-09-12 11:08:02,131] Nothing to repair for keyspace 'system'
[2014-09-12 11:08:02,179] Starting repair command #10, repairing 1389 ranges 
for keyspace megalink
[2014-09-12 11:12:02,196] Lost notification. You should check server log for 
repair status of keyspace megalink
[2014-09-12 11:12:02,258] Starting repair command #11, repairing 1389 ranges 
for keyspace megalink_dev
[2014-09-12 11:12:02,331] Repair command #11 finished
[2014-09-12 11:12:02,346] Starting repair command #12, repairing 512 ranges for 
keyspace system_traces
[2014-09-12 11:13:02,349] Lost notification. You should check server log for 
repair status of keyspace system_traces
{code}

If there any more details that are needed to help solve this problem please let 
me know and I will do my best to provide them.  Please let me know how I can 
help.


was (Author: [email protected]):
We have 3 data centers, each with 4 nodes (running on physical machines not on 
EC2). We have been running Cassandra 2.0.6 and have not been able to 
successfully run *nodetool repair* on any of our nodes (except when no data or 
almost no data was loaded into our keyspaces). We upgraded to Cassandra 2.0.10 
hoping that this issue of *Lost notification* during the *nodetool repair* 
would be fixed, but as you can see from the log below, we still have not been 
able to successfully run *nodetool repair*.

{code}
[2014-09-12 11:08:02,131] Nothing to repair for keyspace 'system'
[2014-09-12 11:08:02,179] Starting repair command #10, repairing 1389 ranges 
for keyspace megalink
[2014-09-12 11:12:02,196] Lost notification. You should check server log for 
repair status of keyspace megalink
[2014-09-12 11:12:02,258] Starting repair command #11, repairing 1389 ranges 
for keyspace megalink_dev
[2014-09-12 11:12:02,331] Repair command #11 finished
[2014-09-12 11:12:02,346] Starting repair command #12, repairing 512 ranges for 
keyspace system_traces
[2014-09-12 11:13:02,349] Lost notification. You should check server log for 
repair status of keyspace system_traces
{code}

If there any more details that are needed to help solve this problem please let 
me know and I will do my best to provide them.

> Repair hangs
> ------------
>
>                 Key: CASSANDRA-7904
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server, 
> java version "1.7.0_45"
>            Reporter: Duncan Sands
>         Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134, 
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres.  Not used on the weekend, so 
> repair is run on all nodes (in a staggered fashion) on the weekend.  Nodetool 
> options: -par -pr.  There is usually some overlap in the repairs: repair on 
> one node may well still be running when repair is started on the next node.  
> Repair hangs for some of the nodes almost every weekend.  It hung last 
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last 
> restarted.  This node is 192.168.60.136 and the exception is harmless: a 
> client disconnected abruptly.
> tpstats
>   4 nodes have a non-zero value for "active" or "pending" in 
> AntiEntropySessions.  These nodes all have Active => 1 and Pending => 1.  The 
> nodes are:
>   192.168.21.13 (data centre R)
>   192.168.60.134 (data centre A)
>   192.168.60.136 (data centre A)
>   172.18.68.138 (data centre Z)
> compactionstats:
>   No compactions.  All nodes have:
>     pending tasks: 0
>     Active compaction remaining time :        n/a
> netstats:
>   All except one node have nothing.  One node (192.168.60.131, not one of the 
> nodes listed in the tpstats section above) has (note the Responses Pending 
> value of 1):
>     Mode: NORMAL
>     Not sending any streams.
>     Read Repair Statistics:
>     Attempted: 4233
>     Mismatch (Blocking): 0
>     Mismatch (Background): 243
>     Pool Name                    Active   Pending      Completed
>     Commands                        n/a         0       34785445
>     Responses                       n/a         1       38567167
> Repair sessions
>   I looked for repair sessions that failed to complete.  On 3 of the 4 nodes 
> mentioned in tpstats above I found that they had sent merkle tree requests 
> and got responses from all but one node.  In the log file for the node that 
> failed to respond there is no sign that it ever received the request.  On 1 
> node (172.18.68.138) it looks like responses were received from every node, 
> some streaming was done, and then... nothing.  Details:
>   Node 192.168.21.13 (data centre R):
>     Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142, 
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table 
> brokers, never got a response from /172.18.68.139.  On /172.18.68.139, just 
> before this time it sent a response for the same repair session but a 
> different table, and there is no record of it receiving a request for table 
> brokers.
>   Node 192.168.60.134 (data centre A):
>     Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132, 
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a 
> response from /172.18.68.138.  On /172.18.68.138, just before this time it 
> sent a response for the same repair session but a different table, and there 
> is no record of it receiving a request for table swxess_outbound.
>   Node 192.168.60.136 (data centre A):
>     Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for 
> table rollups7200, never got a response from /172.18.68.139.  This repair 
> session is never mentioned in the /172.18.68.139 log.
>   Node 172.18.68.138 (data centre Z):
>     The issue here seems to be repair session 
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311.  It got responses for all its merkle 
> tree requests, did some streaming, but seems to have stopped after finishing 
> with one table (rollups60).  I found it as follows: it is the only repair for 
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-7904) Repair hangs

Reply via email to