[
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Knighton updated CASSANDRA-12653:
--------------------------------------
Status: Awaiting Feedback (was: Open)
A few questions/comments on the latest patches:
* On all versions, a space is missing in the conditional {{if(firstSynSendAt ==
0)}}
* On 2.2, it looks like the patch adds the {{valuesEqual}} method from later
versions. Was this intentional? It looks unused.
* On all versions, using {{firstSynSendAt == 0}} to check if it has been
initialized isn't entirely safe. It's entirely legal (although admittedly rare)
for {{System.nanoTime}} to return 0. If this happened, all acks would be
rejected.
* Comparisons for two {{System.nanoTime}} values (such as in
{{GossipDigestAckVerbHandler}}) should not use t1 < t2. Instead, one should
check the difference (t1 - t2 < 0) because numerical overflow could occur in
the {{System.nanoTime}} long.
* In {{maybeFinishShadowRound}}/{{finishShadowRound}}, we should add the states
to the {{endpointShadowStateMap}} before setting {{inShadowRound}} to false. It
looks like the current behavior admits a race where {{doShadowRound}} could
read {{inShadowRound == false}} and exit its loop and copy the
endpointShadowStateMap before it is filled by shadow round finish.
* I believe {{firstSynSendAt}} is accessed from multiple threads and needs to
be volatile.
> In-flight shadow round requests
> -------------------------------
>
> Key: CASSANDRA-12653
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
> Project: Cassandra
> Issue Type: Bug
> Components: Distributed Metadata
> Reporter: Stefan Podkowinski
> Assignee: Stefan Podkowinski
> Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
> Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check
> some host IDs or tokens by doing a gossip "shadow round" once before joining
> the cluster. This is done by sending a gossip SYN to all seeds until we
> receive a response with the cluster state, from where we can move on in the
> bootstrap process. Receiving a response will call the shadow round done and
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state
> again.
> The issue here is that at this point there might be other in-flight requests
> and it's very likely that shadow round responses from other seeds will be
> received afterwards, while the current state of the bootstrap process doesn't
> expect this to happen (e.g. gossiper may or may not be enabled).
> One side effect will be that MigrationTasks are spawned for each shadow round
> reply except the first. Tasks might or might not execute based on whether at
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at
> start of the task. You'll see error log messages such as follows when this
> happend:
> {noformat}
> INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 -
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1] 2016-09-08 08:36:39,255 FailureDetector.java:223
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this,
> but it would be good to get a second opinion (feel free to close as "wont
> fix").
> /cc [~Stefania] [~thobbs]
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)