[
https://issues.apache.org/jira/browse/CASSANDRA-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13914048#comment-13914048
]
Ananthkumar K S commented on CASSANDRA-6772:
--------------------------------------------
No. Handshaking process between the nodes in datacenters in failing. It
initiates a handshaking process and client on both sides don't complete the
connection post this private link failure. Retries seems to be happening
continuously and we are able to see that in TCP dump too. So , at the network
level, they seem to be fine on both the ends.
To confirm this, I did a restart of all the nodes in both the data centers and
now it's all proper. But this should have happened at the application level
without a restart. A private link failure could be a common use case. Moreover,
the TCP retries are done at the cassandra defaults of 5 seconds.
If you need a detail at the network level, post private link failure, I could
see a lot of connection in FIN_WAIT1 mode. There were also considerable amount
of CLOSED_WAIT modes. This will again result in connection in TCP drops
eventually. Guess, some sort of communication is expected from the client level
to the server to close all this at the network layer in case of this too. Last
one is just my assumption based on my knowledge on TCP transitions.
> Cassandra inter data center communication broken
> ------------------------------------------------
>
> Key: CASSANDRA-6772
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6772
> Project: Cassandra
> Issue Type: Bug
> Environment: CentOS 6.0
> Reporter: Ananthkumar K S
> Priority: Blocker
>
> I have two data enters DC1 and DC2. Both communicate via a private link.
> Yesterday, we had a problem with a private link for 10 mins. From the time
> the problem was resolved, nodes in both data centers are not able to
> communicate with each other. When I do a nodetool status on a node in DC1,
> the nodes in DC2 are stated as down. When tried in DC2, nodes in DC1 are
> shown as down .
> But in the cassandra logs, we can clearly see that handshaking is failing
> every 5 seconds for communication between data centres. At TCP level, there
> are too many fin_wait1 generated by cassandra which is still a puzzle .
> Closed_wait top transitions due to this is very high. Due to this kind of
> problem of TCP listen drops, we moved from 2.0.1 to 2.0.3. In 2.0.1, it was
> within data center itself. But here it's between data centers. If it has
> anything to do with the snitch configuration, I am using
> GossipingPropertyFileSnitch.
> This clearly started happening post private link failure. Any idea on this?
> Cassandra version used is 2.0.3
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)