[
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132122#comment-14132122
]
Razi Khaja edited comment on CASSANDRA-7904 at 9/12/14 9:30 PM:
----------------------------------------------------------------
We have 3 data centers, each with 4 nodes (running on physical machines not on
EC2). We have been running Cassandra 2.0.6 and have not been able to
successfully run *nodetool repair* on any of our nodes (except when no data or
almost no data was loaded into our keyspaces). We upgraded to Cassandra 2.0.10
hoping that this issue of *Lost notification* during the *nodetool repair*
would be fixed, but as you can see from the log below, we still have not been
able to successfully run *nodetool repair*.
{code}
[2014-09-12 11:08:02,131] Nothing to repair for keyspace 'system'
[2014-09-12 11:08:02,179] Starting repair command #10, repairing 1389 ranges
for keyspace megalink
[2014-09-12 11:12:02,196] Lost notification. You should check server log for
repair status of keyspace megalink
[2014-09-12 11:12:02,258] Starting repair command #11, repairing 1389 ranges
for keyspace megalink_dev
[2014-09-12 11:12:02,331] Repair command #11 finished
[2014-09-12 11:12:02,346] Starting repair command #12, repairing 512 ranges for
keyspace system_traces
[2014-09-12 11:13:02,349] Lost notification. You should check server log for
repair status of keyspace system_traces
{code}
If there any more details that are needed to help solve this problem please let
me know and I will do my best to provide them. Please let me know how I can
help.
was (Author: [email protected]):
We have 3 data centers, each with 4 nodes (running on physical machines not on
EC2). We have been running Cassandra 2.0.6 and have not been able to
successfully run *nodetool repair* on any of our nodes (except when no data or
almost no data was loaded into our keyspaces). We upgraded to Cassandra 2.0.10
hoping that this issue of *Lost notification* during the *nodetool repair*
would be fixed, but as you can see from the log below, we still have not been
able to successfully run *nodetool repair*.
{code}
[2014-09-12 11:08:02,131] Nothing to repair for keyspace 'system'
[2014-09-12 11:08:02,179] Starting repair command #10, repairing 1389 ranges
for keyspace megalink
[2014-09-12 11:12:02,196] Lost notification. You should check server log for
repair status of keyspace megalink
[2014-09-12 11:12:02,258] Starting repair command #11, repairing 1389 ranges
for keyspace megalink_dev
[2014-09-12 11:12:02,331] Repair command #11 finished
[2014-09-12 11:12:02,346] Starting repair command #12, repairing 512 ranges for
keyspace system_traces
[2014-09-12 11:13:02,349] Lost notification. You should check server log for
repair status of keyspace system_traces
{code}
If there any more details that are needed to help solve this problem please let
me know and I will do my best to provide them.
> Repair hangs
> ------------
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server,
> java version "1.7.0_45"
> Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134,
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so
> repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool
> options: -par -pr. There is usually some overlap in the repairs: repair on
> one node may well still be running when repair is started on the next node.
> Repair hangs for some of the nodes almost every weekend. It hung last
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last
> restarted. This node is 192.168.60.136 and the exception is harmless: a
> client disconnected abruptly.
> tpstats
> 4 nodes have a non-zero value for "active" or "pending" in
> AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The
> nodes are:
> 192.168.21.13 (data centre R)
> 192.168.60.134 (data centre A)
> 192.168.60.136 (data centre A)
> 172.18.68.138 (data centre Z)
> compactionstats:
> No compactions. All nodes have:
> pending tasks: 0
> Active compaction remaining time : n/a
> netstats:
> All except one node have nothing. One node (192.168.60.131, not one of the
> nodes listed in the tpstats section above) has (note the Responses Pending
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool Name Active Pending Completed
> Commands n/a 0 34785445
> Responses n/a 1 38567167
> Repair sessions
> I looked for repair sessions that failed to complete. On 3 of the 4 nodes
> mentioned in tpstats above I found that they had sent merkle tree requests
> and got responses from all but one node. In the log file for the node that
> failed to respond there is no sign that it ever received the request. On 1
> node (172.18.68.138) it looks like responses were received from every node,
> some streaming was done, and then... nothing. Details:
> Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142,
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table
> brokers, never got a response from /172.18.68.139. On /172.18.68.139, just
> before this time it sent a response for the same repair session but a
> different table, and there is no record of it receiving a request for table
> brokers.
> Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132,
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a
> response from /172.18.68.138. On /172.18.68.138, just before this time it
> sent a response for the same repair session but a different table, and there
> is no record of it receiving a request for table swxess_outbound.
> Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for
> table rollups7200, never got a response from /172.18.68.139. This repair
> session is never mentioned in the /172.18.68.139 log.
> Node 172.18.68.138 (data centre Z):
> The issue here seems to be repair session
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle
> tree requests, did some streaming, but seems to have stopped after finishing
> with one table (rollups60). I found it as follows: it is the only repair for
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)