[ 
https://issues.apache.org/jira/browse/CASSANDRA-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795253#comment-13795253
 ] 

Jeremiah Jordan commented on CASSANDRA-5692:
--------------------------------------------

[~sbtourist] no thread dumps.  given nodes e1-e5.  Queries to e1, and "describe 
cluster" on e1 showed all nodes.  Queries to e5 would usually timeout, and 
"describe cluster;" on e5 would show e1-e3 as "UNAVAILABLE".  netstat -tn | 
grep <e1 ip> on e5 showed:

{noformat}
e5:7000 <-> e1:<high port>
e5:<high port> <-> e1:7000
e5:<high port 2> <-> e1:7000
{noformat}

Where e5 to a node which responded to the "describe cluster;" showed:

{noformat}
e5:7000 <-> e4:<high port>
e5:7000 <-> e4:<high port 2>
e5:<high port> <-> e4:7000
e5:<high port 2> <-> e4:7000
{noformat}

And in the TRACE level logs for IncomingTcpConnection.java when restarting e5 I 
only see one connection come in from e1, but there are two that come in from 
e4.  When I enabled TRACE level logs for e1, it was just spitting out something 
about "version 5" over and over really fast.  At that point we restarted e1, 
and it came up and everything was happy from e1->e5, so then we did a rolling 
restart on the cluster and went to bed.


> Race condition in detecting version on a mixed 1.1/1.2 cluster
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-5692
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5692
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.9, 1.2.5
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Minor
>             Fix For: 1.2.7, 2.0 beta 1
>
>         Attachments: 5692-0005.patch, 5692-0006.patch
>
>
> On a mixed 1.1 / 1.2 cluster, starting 1.2 nodes fires sometimes a race 
> condition in version detection, where the 1.2 node wrongly detects version 6 
> for a 1.1 node.
> It works as follows:
> 1) The just started 1.2 node quickly opens an OutboundTcpConnection toward a 
> 1.1 node before receiving any messages from the latter.
> 2) Given the version is correctly detected only when the first message is 
> received, the version is momentarily set at 6.
> 3) This opens an OutboundTcpConnection from 1.2 to 1.1 at version 6, which 
> gets stuck in the connect() method.
> Later, the version is correctly fixed, but all outbound connections from 1.2 
> to 1.1 are stuck at this point.
> Evidence from 1.2 logs:
> TRACE 13:48:31,133 Assuming current protocol version for /127.0.0.2
> DEBUG 13:48:37,837 Setting version 5 for /127.0.0.2



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to