[
https://issues.apache.org/jira/browse/CASSANDRA-7904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anuj Wadehra updated CASSANDRA-7904:
------------------------------------
Attachment: (was: Repair_DEBUG_On_OutboundTcpConnection.txt)
> Repair hangs
> ------------
>
> Key: CASSANDRA-7904
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7904
> Project: Cassandra
> Issue Type: Bug
> Environment: C* 2.0.10, ubuntu 14.04, Java HotSpot(TM) 64-Bit Server,
> java version "1.7.0_45"
> Reporter: Duncan Sands
> Attachments: ls-172.18.68.138, ls-192.168.21.13, ls-192.168.60.134,
> ls-192.168.60.136
>
>
> Cluster of 22 nodes spread over 4 data centres. Not used on the weekend, so
> repair is run on all nodes (in a staggered fashion) on the weekend. Nodetool
> options: -par -pr. There is usually some overlap in the repairs: repair on
> one node may well still be running when repair is started on the next node.
> Repair hangs for some of the nodes almost every weekend. It hung last
> weekend, here are the details:
> In the whole cluster, only one node had an exception since C* was last
> restarted. This node is 192.168.60.136 and the exception is harmless: a
> client disconnected abruptly.
> tpstats
> 4 nodes have a non-zero value for "active" or "pending" in
> AntiEntropySessions. These nodes all have Active => 1 and Pending => 1. The
> nodes are:
> 192.168.21.13 (data centre R)
> 192.168.60.134 (data centre A)
> 192.168.60.136 (data centre A)
> 172.18.68.138 (data centre Z)
> compactionstats:
> No compactions. All nodes have:
> pending tasks: 0
> Active compaction remaining time : n/a
> netstats:
> All except one node have nothing. One node (192.168.60.131, not one of the
> nodes listed in the tpstats section above) has (note the Responses Pending
> value of 1):
> Mode: NORMAL
> Not sending any streams.
> Read Repair Statistics:
> Attempted: 4233
> Mismatch (Blocking): 0
> Mismatch (Background): 243
> Pool Name Active Pending Completed
> Commands n/a 0 34785445
> Responses n/a 1 38567167
> Repair sessions
> I looked for repair sessions that failed to complete. On 3 of the 4 nodes
> mentioned in tpstats above I found that they had sent merkle tree requests
> and got responses from all but one node. In the log file for the node that
> failed to respond there is no sign that it ever received the request. On 1
> node (172.18.68.138) it looks like responses were received from every node,
> some streaming was done, and then... nothing. Details:
> Node 192.168.21.13 (data centre R):
> Sent merkle trees to /172.18.33.24, /192.168.60.140, /192.168.60.142,
> /172.18.68.139, /172.18.68.138, /172.18.33.22, /192.168.21.13 for table
> brokers, never got a response from /172.18.68.139. On /172.18.68.139, just
> before this time it sent a response for the same repair session but a
> different table, and there is no record of it receiving a request for table
> brokers.
> Node 192.168.60.134 (data centre A):
> Sent merkle trees to /172.18.68.139, /172.18.68.138, /192.168.60.132,
> /192.168.21.14, /192.168.60.134 for table swxess_outbound, never got a
> response from /172.18.68.138. On /172.18.68.138, just before this time it
> sent a response for the same repair session but a different table, and there
> is no record of it receiving a request for table swxess_outbound.
> Node 192.168.60.136 (data centre A):
> Sent merkle trees to /192.168.60.142, /172.18.68.139, /192.168.60.136 for
> table rollups7200, never got a response from /172.18.68.139. This repair
> session is never mentioned in the /172.18.68.139 log.
> Node 172.18.68.138 (data centre Z):
> The issue here seems to be repair session
> #a55c16e1-35eb-11e4-8e7e-51c077eaf311. It got responses for all its merkle
> tree requests, did some streaming, but seems to have stopped after finishing
> with one table (rollups60). I found it as follows: it is the only repair for
> which there is no "session completed successfully" message in the log.
> Some log file snippets are attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)