[
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dai Shi updated ZOOKEEPER-3756:
-------------------------------
Description:
Not sure if this is the place to ask, please close if it's not.
I am seeing some behavior that I can't explain since upgrading to 3.5:
In a 5 member quorum, when server 3 is the leader and each server has this in
their configuration:
{code:java}
server.1=100.71.255.254:2888:3888:participant;2181
server.2=100.71.255.253:2888:3888:participant;2181
server.3=100.71.255.252:2888:3888:participant;2181
server.4=100.71.255.251:2888:3888:participant;2181
server.5=100.71.255.250:2888:3888:participant;2181{code}
If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in
the logs:
{code:java}
2020-03-11 20:23:35,720 [myid:2] - INFO
[QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] -
LOOKING
2020-03-11 20:23:35,721 [myid:2] - INFO
[QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
- New election. M
y id = 2, proposed zxid=0x1b8005f4bba
2020-03-11 20:23:35,733 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
so dropping the conn
ection: (3, 2)
2020-03-11 20:23:35,734 [myid:2] - INFO
[0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection
request 100.126.116.201:36140
2020-03-11 20:23:35,735 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
so dropping the connection: (4, 2)
2020-03-11 20:23:35,740 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
so dropping the connection: (5, 2)
2020-03-11 20:23:35,740 [myid:2] - INFO
[0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection
request 100.126.116.201:36142
2020-03-11 20:23:35,740 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message
format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config
version)
2020-03-11 20:23:35,742 [myid:2] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting for
message on queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
2020-03-11 20:23:35,744 [myid:2] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread id
3 my id = 2
2020-03-11 20:23:35,745 [myid:2] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker{code}
The only way I can seem to get them to rejoin the quorum is to restart the
leader.
However, if I remove server 4 and 5 from the configuration of server 1 or 2 (so
only servers 1, 2, and 3 remain in the configuration file), then they can
rejoin the quorum fine. Is this expected and am I doing something wrong? Any
help or explanation would be greatly appreciated. Thank you.
was:
Not sure if this is the place to ask, please close if it's not.
I am seeing some behavior that I can't explain since upgrading to 3.5:
In a 5 member quorum, when server 3 is the leader and each server has this in
their configuration:
{{server.1=100.71.255.254:2888:3888:participant;2181}}
{{server.2=100.71.255.253:2888:3888:participant;2181}}
{{server.3=100.71.255.252:2888:3888:participant;2181}}
{{server.4=100.71.255.251:2888:3888:participant;2181}}
{{server.5=100.71.255.250:2888:3888:participant;2181}}
If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in
the logs:
{quote}2020-03-11 20:23:35,720 [myid:2] - INFO
[QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] -
LOOKING
2020-03-11 20:23:35,721 [myid:2] - INFO
[QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
- New election. M
y id = 2, proposed zxid=0x1b8005f4bba
2020-03-11 20:23:35,733 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
so dropping the conn
ection: (3, 2)
2020-03-11 20:23:35,734 [myid:2] - INFO
[0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection
request 100.126.116.201:36140
2020-03-11 20:23:35,735 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
so dropping the connection: (4, 2)
2020-03-11 20:23:35,740 [myid:2] - INFO
[WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
so dropping the connection: (5, 2)
2020-03-11 20:23:35,740 [myid:2] - INFO
[0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection
request 100.126.116.201:36142
2020-03-11 20:23:35,740 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message
format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config
version)
2020-03-11 20:23:35,742 [myid:2] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting for
message on queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
2020-03-11 20:23:35,744 [myid:2] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread id
3 my id = 2
2020-03-11 20:23:35,745 [myid:2] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker
{quote}
The only way I can seem to get them to rejoin the quorum is to restart the
leader.
However, if I remove server 4 and 5 from the configuration of server 1 or 2 (so
only servers 1, 2, and 3 remain in the configuration file), then they can
rejoin the quorum fine. Is this expected and am I doing something wrong? Any
help or explanation would be greatly appreciated. Thank you.
> Members failing to rejoin quorum
> --------------------------------
>
> Key: ZOOKEEPER-3756
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
> Project: ZooKeeper
> Issue Type: Improvement
> Components: leaderElection
> Affects Versions: 3.5.6, 3.5.7
> Reporter: Dai Shi
> Priority: Major
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in
> their configuration:
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] -
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
> - New election. M
> y id = 2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
> so dropping the conn
> ection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection
> request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
> so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier,
> so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection
> request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message
> format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config
> version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread
> id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting
> SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart the
> leader.
> However, if I remove server 4 and 5 from the configuration of server 1 or 2
> (so only servers 1, 2, and 3 remain in the configuration file), then they can
> rejoin the quorum fine. Is this expected and am I doing something wrong? Any
> help or explanation would be greatly appreciated. Thank you.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)