[
https://issues.apache.org/jira/browse/ZOOKEEPER-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753063#comment-17753063
]
Coran Stow commented on ZOOKEEPER-2938:
---------------------------------------
I hit this issue on a client site where quorum formation was hit and miss, so I
had a look through the code.
The Peers try to ensure that they only have one connection between any pair of
servers in the peer group to manage the quorum formation, and they do this by
dropping the connection when the server that initiates the connection finds
that the server it has connected to has a higher zkid. If the server that
initiates the connection finds that it has a higher zkid then it maintains the
connection. That's where we get:
Have smaller server identifier, so dropping the connection:
This is expected behaviour. The problem is actually that the server with the
higher zkid fails to sustain a connection to the server that is emitting this
error.
As part of resolving the peer list for the quorum, each node scans the initial
configuration and if it notices that the ID of the server it reads from the
config is its own zkid, it stores its own hostname(s). It seems this was done
as part of https://issues.apache.org/jira/browse/ZOOKEEPER-107 to allow a
server to discover its own hostname(s) and communicate them to the other nodes
to allow for dynamic reconfiguration.
So, when the quorum manager initiates a connection to a peer, in sends its
hostnames to that peer.
([here|https://github.com/apache/zookeeper/blob/15f29b51a22bc51b9d6074cb7f3e72bb00a9753a/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L466])
When a quorum manager successfully _receives_ a connection from a peer and it
has a higher zkid than the server it has received a connection from it
_immediately_ initiates a connection back to the sender, using the zkid and
hostname(s) sent in the sender's initial connection. If this all happens as the
receiving server is starting up, it will interfere with that server's attempts
to connect to other servers using its initial configuration because the quorum
manager won't attempt to connect to another server if it's already attempting
to connect to a server with that same zkid.
([here|https://github.com/apache/zookeeper/blob/15f29b51a22bc51b9d6074cb7f3e72bb00a9753a/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L597])
In an environment where the hostname that a Zookeeper server discovers for
itself is different from the hostname other servers use to discover it, eg when
your ensemble spans multiple Kubernetes clusters, the hostname that a node
discovers for itself might not be suitable for other nodes to try and resolve.
If the node discovers its hostname is
{{{}zookeeper1.zookeeper.zookeeper-us-east-dc.svc.cluster.local{}}}, it can
pass that on to other nodes in the ensemble who may then fail to connect
because they're in a different Kubernetes cluster.
Setting the hostname to "{{{}0.0.0.0{}}}" works because Zookeeper can bind to
that IP address, but there is no hostname to pass on to other servers. Or more
technically, it passes its hostname as {{null}} to other servers which then
handle the {{null}} address by using the address they already have in their
configuration.
> Server is unable to join quorum after connection broken to other peers
> ----------------------------------------------------------------------
>
> Key: ZOOKEEPER-2938
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2938
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.6, 3.4.14
> Reporter: Abhay Bothra
> Priority: Major
>
> We see the following logs in the node with {{myid: 1}}
> {code}
> 2017-11-08 15:06:28,375 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (2, 1)
> 2017-11-08 15:06:28,375 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (3, 1)
> 2017-11-08 15:07:28,375 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message
> format version), 1 (n.leader), 0x28e000a8750 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 1 (n.sid), 0x28e (n.peerEpoch) LOOKING (my state)
> 2017-11-08 15:07:28,375 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (2, 1)
> 2017-11-08 15:07:28,376 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (3, 1)
> 2017-11-08 15:08:28,375 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message
> format version), 1 (n.leader), 0x28e000a8750 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 1 (n.sid), 0x28e (n.peerEpoch) LOOKING (my state)
> 2017-11-08 15:08:28,376 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (2, 1)
> 2017-11-08 15:08:28,376 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (3, 1)
> 2017-11-08 15:09:28,376 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message
> format version), 1 (n.leader), 0x28e000a8750 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 1 (n.sid), 0x28e (n.peerEpoch) LOOKING (my state)
> 2017-11-08 15:09:28,376 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (2, 1)
> 2017-11-08 15:09:28,376 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (3, 1)
> 2017-11-08 15:10:28,376 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection@597] - Notification: 1 (message
> format version), 1 (n.leader), 0x28e000a8750 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 1 (n.sid), 0x28e (n.peerEpoch) LOOKING (my state)
> 2017-11-08 15:10:28,376 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (2, 1)
> 2017-11-08 15:10:28,377 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@193] - Have smaller server identifier,
> so dropping the connection: (3, 1)
> {code}
> On the nodes with {{myid: 2}} and {{myid: 3}}, we see connection broken
> events for {{myid: 1}}
> {code}
> 2017-11-07 02:54:32,135 [myid:2] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@780] - Connection broken for id 1,
> my id = 2, error =
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:209)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.net.SocketInputStream.read(SocketInputStream.java:223)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
> 2017-11-07 02:54:32,135 [myid:2] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
> 2017-11-07 02:54:32,135 [myid:2] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@697] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:849)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:64)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:685)
> 2017-11-07 02:54:32,135 [myid:2] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@706] - Send worker leaving thread
> {code}
> From the reported occurrences, it looks like this is a problem only when the
> node with the smallest {{myid}} loses connection.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)