Mina Naguib created CASSANDRA-4740:
--------------------------------------

             Summary: Phantom TCP connections, failing hinted handoff
                 Key: CASSANDRA-4740
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4740
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 1.1.2
         Environment: Linux 3.4.9, java 1.6.0_35-b10
            Reporter: Mina Naguib
            Priority: Minor


IP addresses in report anonymized:

Had a server running cassandra (1.1.1.10) reboot ungracefully.  Reboot and 
startup was successful and uneventful.  cassandra went back into service ok.

>From that point onwards however, several (but not all) machines in the 
>cassandra cluster started having difficulty with hinted handoff to that 
>machine.  This was despite nodetool ring showing Up across the board.

Here's an example of an attempt, every 10 minutes, by a node (1.1.1.11) to 
replay hints to the node that was rebooted:
{code}
INFO [HintedHandoff:1] 2012-10-01 11:07:23,293 HintedHandOffManager.java (line 
294) Started hinted handoff for token: 122879743610338889583996386017027409691 
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java (line 
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java (line 
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10

INFO [HintedHandoff:1] 2012-10-01 11:17:23,312 HintedHandOffManager.java (line 
294) Started hinted handoff for token: 122879743610338889583996386017027409691 
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java (line 
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java (line 
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10

INFO [HintedHandoff:1] 2012-10-01 11:27:23,335 HintedHandOffManager.java (line 
294) Started hinted handoff for token: 122879743610338889583996386017027409691 
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java (line 
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java (line 
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10

INFO [HintedHandoff:1] 2012-10-01 11:37:23,357 HintedHandOffManager.java (line 
294) Started hinted handoff for token: 122879743610338889583996386017027409691 
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:37:33,358 HintedHandOffManager.java (line 
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:37:33,359 HintedHandOffManager.java (line 
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10

INFO [HintedHandoff:1] 2012-10-01 11:47:23,412 HintedHandOffManager.java (line 
294) Started hinted handoff for token: 122879743610338889583996386017027409691 
with IP: /1.1.1.10
INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java (line 
372) Timed out replaying hints to /1.1.1.10; aborting further deliveries
INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java (line 
390) Finished hinted handoff of 0 rows to endpoint /1.1.1.10
{code}

I started poking around, and discovered that several nodes held ESTABLISHED TCP 
connections that didn't have a live endpoint on the rebooted node.  My guess is 
they were live prior to the reboot, and after the reboot the nodes still see 
them as live and unsuccessfully try to use them.

Example, on the node that was rebooted:

{code}
.10 ~ # netstat -tn | grep 1.1.1.11
tcp        0      0 1.1.1.10:7000        1.1.1.11:40960        ESTABLISHED
tcp        0      0 1.1.1.10:34370       1.1.1.11:7000         ESTABLISHED
tcp        0      0 1.1.1.10:45518       1.1.1.11:7000         ESTABLISHED
{code}

While on the node that's failing to hint to it:
{code}
.11 ~ # netstat -tn | grep 1.1.1.10
tcp        0      0 1.1.1.11:7000         1.1.1.10:34370       ESTABLISHED
tcp        0      0 1.1.1.11:7000         1.1.1.10:45518       ESTABLISHED
tcp        0      0 1.1.1.11:7000         1.1.1.10:53316       ESTABLISHED
tcp        0      0 1.1.1.11:7000         1.1.1.10:43239       ESTABLISHED
tcp        0      0 1.1.1.11:40960        1.1.1.10:7000        ESTABLISHED
{code}

Notice the phantom connections on :53316 and :43239 which do not appear on the 
remote 1.1.1.10

On .11 I tried disabling and enabling gossip, but that did not restart :7000 
nor clean up the 2 phantom connections.  For good measure I also tried 
disabling and enabling thrift (long shot), and that didn't help either.

The only thing that helped was to actually stop and start cassandra, in a 
rolling fashion, on each node that was having trouble hinting to the machine 
that was rebooted.  The phantom connections naturally disappeared, write volume 
on 1.1.1.10 rose for a while, and all the hints were sent successfully.

I'm unsure whether the phantom TCP connections are a cause, or just loosely 
correlated, to the hinted handoff failure every 10 minutes.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to