[jira] [Commented] (CASSANDRA-5669) ITC.close() resets peer msg version, causes connection thrashing in ec2 during upgrade

Jonathan Ellis (JIRA) Thu, 20 Jun 2013 09:25:58 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13689376#comment-13689376
 ]


Jonathan Ellis commented on CASSANDRA-5669:
-------------------------------------------

bq. It looks like we'll (re)set the version on any new connection from a given 
node, so I'm not sure we need to explicitly throw away the version on close()

Here's the scenario.  A is 1.2.  B is 1.1.

B is restarted for upgrade.  A reconnects to B before B connects to A -- maybe 
it had an "undroppable" command to retry, or maybe it's just luck of the draw 
that A gossips or sends a command to B.

If we don't reset the version on close, A will connect to B as 1.1, and then B 
will think, "Oh, A is a 1.1 node, I'd better connect to him that way too."

bq. The problem I'm trying to solve here is the upgraded node trying to contact 
the older node, and things getting wonky (data race) when the 
Ec2MultiRegionSnitch chooses to close the publicIP connection in favor of the 
localIP

So damned if you do, damned if you don't...

What if we add logic to EC2MRS to only reconnect if we're both on the current 
version?  1.1 -> 1.2 would reconnect then (because 1.2 drops down to 1.1 after 
initial negotiation) but that's okay since it would reconnect at 1.1 again.  
1.2 -> 1.1 would not reconnect, so you'd have extra public traffic until 
everyone upgrades.  Acceptable?
                
> ITC.close() resets peer msg version, causes connection thrashing in ec2 
> during upgrade
> --------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5669
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5669
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.5
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
>              Labels: gossip
>             Fix For: 1.2.6, 2.0 beta 1
>
>         Attachments: 5669-v1.diff
>
>
> While debugging the upgrading scenario described in CASSANDRA-5660, I 
> discovered the ITC.close() will reset the message protocol version of a peer 
> node that disconnects. CASSANDRA-5660 has a full description of the upgrade 
> path, but basically the Ec2MultiRegionSnitch will close connections on the 
> publicIP addr to reconnect on the privateIp, and this causes ITC to drop the 
> message protocol version of previously known nodes. I think we want to hang 
> onto that version so that when the newer node (re-)connects to the lower node 
> version, it passes the correct protocol version rather than the current 
> version (too high for the older node),the connection attempt getting dropped, 
> and going through the dance again.
> To clarify, the 'thrashing' is at a rather low volume, from what I observed. 
> Anecdotaly, perhaps one connection per second gets turned over.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-5669) ITC.close() resets peer msg version, causes connection thrashing in ec2 during upgrade

Reply via email to