[ 
https://issues.apache.org/jira/browse/CASSANDRA-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103667#comment-16103667
 ] 

Jason Brown commented on CASSANDRA-13700:
-----------------------------------------

[~jkni] In our internal review of this change, we discovered that that patch 
can be made incrementally safer. In 
{{Gossiper#getStateForVersionBiggerThan()}}, we [get the 
generation|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/Gossiper.java#L899],
 but it's possible the {{epState.getHeartBeatState()}} could have been swapped 
out before the next line executes (where we get the version).

Are you ok if I make the following small change to 
{{Gossiper#getStateForVersionBiggerThan()}}:
{code}
-            int localHbGeneration = 
epState.getHeartBeatState().getGeneration();
-            int localHbVersion = 
epState.getHeartBeatState().getHeartBeatVersion();
+            HeartBeatState heartBeatState = epState.getHeartBeatState();
+            int localHbGeneration = heartBeatState.getGeneration();
+            int localHbVersion = heartBeatState.getHeartBeatVersion();
{code}

Basically, just grab a reference the the same `HeartBeatState` that we'll use 
for both the generation and version. Of course, this does not protect against 
that `HeartBeatState` instance being mutated, but at least we can be smarter 
about what we reference from the `epState`.

> Heartbeats can cause gossip information to go permanently missing on certain 
> nodes
> ----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13700
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13700
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Joel Knighton
>            Assignee: Joel Knighton
>            Priority: Critical
>             Fix For: 2.2.11, 3.0.15, 3.11.1, 4.0, 2.1.19
>
>
> In {{Gossiper.getStateForVersionBiggerThan}}, we add the {{HeartBeatState}} 
> from the corresponding {{EndpointState}} to the {{EndpointState}} to send. 
> When we're getting state for ourselves, this means that we add a reference to 
> the local {{HeartBeatState}}. Then, once we've built a message (in either the 
> Syn or Ack handler), we send it through the {{MessagingService}}. In the case 
> that the {{MessagingService}} is sufficiently slow, the {{GossipTask}} may 
> run before serialization of the Syn or Ack. This means that when the 
> {{GossipTask}} acquires the gossip {{taskLock}}, it may increment the 
> {{HeartBeatState}} version of the local node as stored in the endpoint state 
> map. Then, when we finally serialize the Syn or Ack, we'll follow the 
> reference to the {{HeartBeatState}} and serialize it with a higher version 
> than we saw when constructing the Ack or Ack2.
> Consider the case where we see {{HeartBeatState}} with version 4 when 
> constructing an Ack and send it through the {{MessagingService}}. Then, we 
> add some piece of state with version 5 to our local {{EndpointState}}. If 
> {{GossipTask}} runs and increases the {{HeartBeatState}} version to 6 before 
> the {{MessageOut}} containing the Ack is serialized, the node receiving the 
> Ack will believe it is current to version 6, despite the fact that it has 
> never received a message containing the {{ApplicationState}} tagged with 
> version 5.
> I've reproduced in this in several versions; so far, I believe this is 
> possible in all versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to