[ https://issues.apache.org/jira/browse/CASSANDRA-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13794960#comment-13794960 ]
Sergio Bossa commented on CASSANDRA-5692: ----------------------------------------- [~jjordan], do we have thread dumps from the timeout failures (prior the timeout)? If that didn't involve the connect method, we're probably seeing a different race. Anyways, I'll have a look. > Race condition in detecting version on a mixed 1.1/1.2 cluster > -------------------------------------------------------------- > > Key: CASSANDRA-5692 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5692 > Project: Cassandra > Issue Type: Bug > Affects Versions: 1.1.9, 1.2.5 > Reporter: Sergio Bossa > Assignee: Sergio Bossa > Priority: Minor > Fix For: 1.2.7, 2.0 beta 1 > > Attachments: 5692-0005.patch, 5692-0006.patch > > > On a mixed 1.1 / 1.2 cluster, starting 1.2 nodes fires sometimes a race > condition in version detection, where the 1.2 node wrongly detects version 6 > for a 1.1 node. > It works as follows: > 1) The just started 1.2 node quickly opens an OutboundTcpConnection toward a > 1.1 node before receiving any messages from the latter. > 2) Given the version is correctly detected only when the first message is > received, the version is momentarily set at 6. > 3) This opens an OutboundTcpConnection from 1.2 to 1.1 at version 6, which > gets stuck in the connect() method. > Later, the version is correctly fixed, but all outbound connections from 1.2 > to 1.1 are stuck at this point. > Evidence from 1.2 logs: > TRACE 13:48:31,133 Assuming current protocol version for /127.0.0.2 > DEBUG 13:48:37,837 Setting version 5 for /127.0.0.2 -- This message was sent by Atlassian JIRA (v6.1#6144)