[
https://issues.apache.org/jira/browse/ZOOKEEPER-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291487#comment-17291487
]
Mate Szalay-Beko commented on ZOOKEEPER-4220:
---------------------------------------------
Thanks for the clarification [~mirg]!
In this case this fix will likely improve the situation. Cool! :)
> Redundant connection attempts during leader election if quorum members changed
> ------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-4220
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4220
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.9, 3.6.2
> Reporter: Alex Mirgorodskiy
> Assignee: Mate Szalay-Beko
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.10, 3.6.3, 3.7.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> We've seen a few failures or long delays in electing a new leader when the
> previous one has a hard host reset (as opposed to just the service process
> down, since connections don't need to wait for timeout there). Symptoms are
> similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2164. Reducing
> cnxTimeout from 5 to 1.5 seconds makes the problem much less frequent, but
> doesn't fix it completely. We are still using an old ZooKeeper version
> (3.5.5), and the new async connect feature will presumably avoid it.
> But we noticed a pattern of twice the expected number of connection attempts
> to the same downed instance in the log, and it appears to be due to a code
> glitch in QuorumCnxManager.java:
>
> {code:java}
> synchronized void connectOne(long sid) {
> ...
> if (lastCommittedView.containsKey(sid)) {
> knownId = true;
> if (connectOne(sid, lastCommittedView.get(sid).electionAddr))
> return;
> }
> if (lastSeenQV != null && lastProposedView.containsKey(sid)
> && (!knownId || (lastProposedView.get(sid).electionAddr != <----
> lastCommittedView.get(sid).electionAddr))) {
> knownId = true;
> if (connectOne(sid, lastProposedView.get(sid).electionAddr))
> return;
> }
> {code}
> Comparing electionAddrs should be done with !equals presumably, otherwise
> connectOne will be invoked an extra time even in the common case when the
> addresses do match.
> The code around it has changed recently, but the check itself still exists at
> the top of master. It might not matter as much with the async connects, but
> perhaps it helps even then.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)