[jira] [Comment Edited] (ZOOKEEPER-2164) fast leader election keeps failing

Suhas Dantkale (Jira) Fri, 07 Feb 2020 09:43:49 -0800


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032546#comment-17032546
 ]


Suhas Dantkale edited comment on ZOOKEEPER-2164 at 2/7/20 5:42 PM:
-------------------------------------------------------------------

Hi [~symat], 

Is the zoo.cfg having 0.0.0.0 as it's own hostname representing as self in the 
server list of ensemble?

I don't know if you are running into this issue. But just jotting down  the 
analysis of problem I had faced with 0.0.0.0:
h3. How IP address 0.0.0.0 creates problem with ZK Leader Election protocol?

-> (1) When ZK server starts for the very first time, it sends out its _Initial 
Identification_ to its peers. The zoo.cfg file has its peer’s hostname and 
Election port info. In its _Initial Identification_ represented by _class 
InitialMessage_ in the code, the ZK server sends out its sid and 0.0.0.0 as its 
identification.

(2) When the peers receive and parse this Initial message, they understood it 
as coming from 0.0.0.0. The _handleConnection_ method below does receive and 
parse the message coming from its peer. It saves the identification of its peer 
into _electionAddr_ and _sid_.

*The peer uses this identification to connect back to the ZK server on 0.0.0.0, 
if its own SID is greater than the SID of the originating node.* And before it 
establishes a new connection it, closes the existing connection to the 
originating node if there is any.

I believe it does that to increase the chances the node with higher SID to be 
the leader.

*However, the 0.0.0.0:3888 is not its peer Address and actually it’s its own 
address :)* So the peer Server tries to connect to itself thinking that it’s 
connecting to the originating node.

What this leads to is that the originating node is never able to complete its 
Leader Election and therefore is never able to join the cluster.

If you change the order of the nodes to join the cluster in decreasing order of 
SIDs, this problem doesn't happen for the reasons  described above.

 

We had temporary solved this problem by not using 0.0.0.0 as the IP address of 
the self. Because in our setup, we don't have multiple network interfaces that 
we deal with.

 

However, I was planning to propose the below changes to the community:-

In point (2) above, while connecting back to the originating node, can peer 
look into its own zoo.cfg to get the real hostname or IP address to connect to 
originating node instead of relying on the IP address that comes in its Initial 
message. Since in its initial message, it also has the SID, so it can easily 
get the real identity of the originating node from its zoo.cfg file.

 

I don't know if the above points help you. But just thought of sharing them 
with you.

 


was (Author: suhas.dantkale):
Hi [~symat], 

Is the zoo.cfg having 0.0.0.0 as it's own hostname representing as self in the 
server list of ensemble?

I don't know if you are running into this issue. But just jotting down  the 
analysis of problem I had faced with 0.0.0.0:
h3. How IP address 0.0.0.0 creates problem with ZK Leader Election protocol?

-> (1) When ZK server starts for the very first time, it sends out its _Initial 
Identification_ to its peers. The zoo.cfg file has its peer’s hostname and 
Election port info. In its _Initial Identification_ represented by _class 
InitialMessage_ in the code, the ZK server sends out its sid and 0.0.0.0 as its 
identification.

(2) When the peers receive and parse this Initial message, they understood it 
as coming from 0.0.0.0. The _handleConnection_ method below does receive and 
parse the message coming from its peer. It saves the identification of its peer 
into _electionAddr_ and _sid_.

*The peer uses this identification to connect back to the ZK server on 0.0.0.0, 
if its own SID is greater than the SID of the originating node.* And before it 
establishes a new connection it, closes the existing connection to the 
originating node if there is any.

I believe it does that to increase the chances the node with higher SID to be 
the leader.

*However, the 0.0.0.0:3888 is not its peer Address and actually it’s its own 
address :)* So the peer Server tries to connect to itself thinking that it’s 
connecting to the originating node.

What this leads to is that the originating node is never able to complete its 
Leader Election and therefore is never able to join the cluster.

If you change the order of the nodes to join the cluster in decreasing order of 
SIDs, this problem doesn't happen for the reasons  described above.

 

We had temporary solved this problem by not using 0.0.0.0 as the IP address of 
the self. Because in our setup, we don't have multiple network interfaces that 
we deal with.

 

However, I was planning to propose the below changes to the community:-

In point (2) above, while connecting back to the originating node, can peer 
look into its own zoo.cfg to get the real hostname or IP address to connect to 
originating node instead of relying on the IP address that comes in its Initial 
message. Since in its initial message, it also has the SID, so it can easily 
get the real identity of the originating node from its zoo.cfg file.

 

I don't know if the above points help you. But just thought of sharing them 
with you.

 

 

 

 

 

 

 

> fast leader election keeps failing
> ----------------------------------
>
>                 Key: ZOOKEEPER-2164
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>            Reporter: Michi Mutsuzaki
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>             Fix For: 3.7.0, 3.5.8
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ZOOKEEPER-2164) fast leader election keeps failing

Reply via email to