Mina Naguib created CASSANDRA-4740:
--------------------------------------
Summary: Phantom TCP connections, failing hinted handoff
Key: CASSANDRA-4740
URL: https://issues.apache.org/jira/browse/CASSANDRA-4740
Project: Cassandra
Issue Type: Bug
Affects Versions: 1.1.2
Environment: Linux 3.4.9, java 1.6.0_35-b10
Reporter: Mina Naguib
Priority: Minor
IP addresses in report anonymized:
Had a server running cassandra (1.1.1.10) reboot ungracefully. Reboot and
startup was successful and uneventful. cassandra went back into service ok.
>From that point onwards however, several (but not all) machines in the
>cassandra cluster started having difficulty with hinted handoff to that
>machine. This was despite nodetool ring showing Up across the board.
Here's an example of an attempt, every 10 minutes, by a node (1.1.1.11) to
replay hints to the node that was rebooted:
{code}
INFO [HintedHandoff:1] 2012-10-01 11:07:23,293 HintedHandOffManager.java (line
294) Started hinted handoff for token: 122879743610338889583996386017027409691
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java (line
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java (line
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:17:23,312 HintedHandOffManager.java (line
294) Started hinted handoff for token: 122879743610338889583996386017027409691
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java (line
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java (line
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:27:23,335 HintedHandOffManager.java (line
294) Started hinted handoff for token: 122879743610338889583996386017027409691
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java (line
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java (line
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:37:23,357 HintedHandOffManager.java (line
294) Started hinted handoff for token: 122879743610338889583996386017027409691
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:37:33,358 HintedHandOffManager.java (line
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:37:33,359 HintedHandOffManager.java (line
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:47:23,412 HintedHandOffManager.java (line
294) Started hinted handoff for token: 122879743610338889583996386017027409691
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java (line
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java (line
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
{code}
I started poking around, and discovered that several nodes held ESTABLISHED TCP
connections that didn't have a live endpoint on the rebooted node. My guess is
they were live prior to the reboot, and after the reboot the nodes still see
them as live and unsuccessfully try to use them.
Example, on the node that was rebooted:
{code}
.10 ~ # netstat -tn | grep 1.1.1.11
tcp 0 0 1.1.1.10:7000 1.1.1.11:40960 ESTABLISHED
tcp 0 0 1.1.1.10:34370 1.1.1.11:7000 ESTABLISHED
tcp 0 0 1.1.1.10:45518 1.1.1.11:7000 ESTABLISHED
{code}
While on the node that's failing to hint to it:
{code}
.11 ~ # netstat -tn | grep 1.1.1.10
tcp 0 0 1.1.1.11:7000 1.1.1.10:34370 ESTABLISHED
tcp 0 0 1.1.1.11:7000 1.1.1.10:45518 ESTABLISHED
tcp 0 0 1.1.1.11:7000 1.1.1.10:53316 ESTABLISHED
tcp 0 0 1.1.1.11:7000 1.1.1.10:43239 ESTABLISHED
tcp 0 0 1.1.1.11:40960 1.1.1.10:7000 ESTABLISHED
{code}
Notice the phantom connections on :53316 and :43239 which do not appear on the
remote 1.1.1.10
On .11 I tried disabling and enabling gossip, but that did not restart :7000
nor clean up the 2 phantom connections. For good measure I also tried
disabling and enabling thrift (long shot), and that didn't help either.
The only thing that helped was to actually stop and start cassandra, in a
rolling fashion, on each node that was having trouble hinting to the machine
that was rebooted. The phantom connections naturally disappeared, write volume
on 1.1.1.10 rose for a while, and all the hints were sent successfully.
I'm unsure whether the phantom TCP connections are a cause, or just loosely
correlated, to the hinted handoff failure every 10 minutes.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira