[
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032546#comment-17032546
]
Suhas Dantkale edited comment on ZOOKEEPER-2164 at 2/7/20 5:42 PM:
-------------------------------------------------------------------
Hi [~symat],
Is the zoo.cfg having 0.0.0.0 as it's own hostname representing as self in the
server list of ensemble?
I don't know if you are running into this issue. But just jotting down the
analysis of problem I had faced with 0.0.0.0:
h3. How IP address 0.0.0.0 creates problem with ZK Leader Election protocol?
-> (1) When ZK server starts for the very first time, it sends out its _Initial
Identification_ to its peers. The zoo.cfg file has its peer’s hostname and
Election port info. In its _Initial Identification_ represented by _class
InitialMessage_ in the code, the ZK server sends out its sid and 0.0.0.0 as its
identification.
(2) When the peers receive and parse this Initial message, they understood it
as coming from 0.0.0.0. The _handleConnection_ method below does receive and
parse the message coming from its peer. It saves the identification of its peer
into _electionAddr_ and _sid_.
*The peer uses this identification to connect back to the ZK server on 0.0.0.0,
if its own SID is greater than the SID of the originating node.* And before it
establishes a new connection it, closes the existing connection to the
originating node if there is any.
I believe it does that to increase the chances the node with higher SID to be
the leader.
*However, the 0.0.0.0:3888 is not its peer Address and actually it’s its own
address :)* So the peer Server tries to connect to itself thinking that it’s
connecting to the originating node.
What this leads to is that the originating node is never able to complete its
Leader Election and therefore is never able to join the cluster.
If you change the order of the nodes to join the cluster in decreasing order of
SIDs, this problem doesn't happen for the reasons described above.
We had temporary solved this problem by not using 0.0.0.0 as the IP address of
the self. Because in our setup, we don't have multiple network interfaces that
we deal with.
However, I was planning to propose the below changes to the community:-
In point (2) above, while connecting back to the originating node, can peer
look into its own zoo.cfg to get the real hostname or IP address to connect to
originating node instead of relying on the IP address that comes in its Initial
message. Since in its initial message, it also has the SID, so it can easily
get the real identity of the originating node from its zoo.cfg file.
I don't know if the above points help you. But just thought of sharing them
with you.
was (Author: suhas.dantkale):
Hi [~symat],
Is the zoo.cfg having 0.0.0.0 as it's own hostname representing as self in the
server list of ensemble?
I don't know if you are running into this issue. But just jotting down the
analysis of problem I had faced with 0.0.0.0:
h3. How IP address 0.0.0.0 creates problem with ZK Leader Election protocol?
-> (1) When ZK server starts for the very first time, it sends out its _Initial
Identification_ to its peers. The zoo.cfg file has its peer’s hostname and
Election port info. In its _Initial Identification_ represented by _class
InitialMessage_ in the code, the ZK server sends out its sid and 0.0.0.0 as its
identification.
(2) When the peers receive and parse this Initial message, they understood it
as coming from 0.0.0.0. The _handleConnection_ method below does receive and
parse the message coming from its peer. It saves the identification of its peer
into _electionAddr_ and _sid_.
*The peer uses this identification to connect back to the ZK server on 0.0.0.0,
if its own SID is greater than the SID of the originating node.* And before it
establishes a new connection it, closes the existing connection to the
originating node if there is any.
I believe it does that to increase the chances the node with higher SID to be
the leader.
*However, the 0.0.0.0:3888 is not its peer Address and actually it’s its own
address :)* So the peer Server tries to connect to itself thinking that it’s
connecting to the originating node.
What this leads to is that the originating node is never able to complete its
Leader Election and therefore is never able to join the cluster.
If you change the order of the nodes to join the cluster in decreasing order of
SIDs, this problem doesn't happen for the reasons described above.
We had temporary solved this problem by not using 0.0.0.0 as the IP address of
the self. Because in our setup, we don't have multiple network interfaces that
we deal with.
However, I was planning to propose the below changes to the community:-
In point (2) above, while connecting back to the originating node, can peer
look into its own zoo.cfg to get the real hostname or IP address to connect to
originating node instead of relying on the IP address that comes in its Initial
message. Since in its initial message, it also has the SID, so it can easily
get the real identity of the originating node from its zoo.cfg file.
I don't know if the above points help you. But just thought of sharing them
with you.
> fast leader election keeps failing
> ----------------------------------
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection
> Affects Versions: 3.4.5
> Reporter: Michi Mutsuzaki
> Assignee: Mate Szalay-Beko
> Priority: Major
> Fix For: 3.7.0, 3.5.8
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader.
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't
> timeout for 5 seconds:
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3:
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a
> while, so I'm guessing later versions have the same issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)