[
https://issues.apache.org/jira/browse/CASSANDRA-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884312#comment-13884312
]
Minh Do commented on CASSANDRA-6619:
------------------------------------
Jonathan, you are right that both 1.2 and 1.1 are designed to read out the
versions from the headers of the other. However, 1.2, as a sender in opening
the outbound socket, expects to receive back immediately the version int as
soon as it sends out its own. 1.1, as a receiver, can read 1.2 header but does
not send the version int back.
Here is the piece of code in 1.2 in IncomingTcpConnection.java that sends back
the version int:
private void handleModernVersion(int version, int header) throws IOException
{
DataOutputStream out = new DataOutputStream(socket.getOutputStream());
out.writeInt(MessagingService.current_version);
out.flush();
......
}
Because 1.1 does not send this back immediately, OutboundTcpConnection will be
timed out on the read and the socket gets disconnected. The whole cycle will
get repeated again and again until some code sets the target version right. In
the lucky case, IncomingTcpConnection sets the right target version. However,
it takes a while for the other 1.1 nodes to know that there is a new 1.2 node,
especially if the new 1.2 node can't connection to any 1.1 nodes first.
> Race condition issue during upgrading 1.1 to 1.2
> ------------------------------------------------
>
> Key: CASSANDRA-6619
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6619
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Minh Do
> Assignee: Minh Do
> Priority: Minor
> Fix For: 1.2.14
>
> Attachments: patch.txt
>
>
> There is a race condition during upgrading a C* 1.1x cluster to C* 1.2.
> One issue is that OutboundTCPConnection can't establish from a 1.2 node to
> some 1.1x nodes. Because of this, a live cluster during the upgrading will
> suffer in high read latency and be unable to fulfill some write requests. It
> won't be a problem if there is a small cluster but it is a problem in a large
> cluster (100+ nodes) because the upgrading process takes 10+ hours to 1+
> day(s) to complete.
> Acknowledging about CASSANDRA-5692, however, it is not fully fixed. We
> already have a patch for this and will attach shortly for feedback.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)