[
https://issues.apache.org/jira/browse/ZOOKEEPER-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850561#comment-17850561
]
gendong1 commented on ZOOKEEPER-3923:
-------------------------------------
Here is similar syptom, but different root causes. [ZOOKEEPER-4816] A follower
can not join the cluster for 20s seconds - ASF JIRA (apache.org)
> Leader election issues with Istio
> ---------------------------------
>
> Key: ZOOKEEPER-3923
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3923
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.6.1, 3.5.8
> Reporter: Matteo Merli
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> We have seen repeated occasion in which restarting the ZK leader node can
> lead to ~2min of quorum unavailability. This is easily reproducible on a
> cluster when leader election happens through a TCP proxying layer (such as
> Istio) in a Kubernetes environment.
> This happens > ~80% of the times on a 3 nodes cluster.
> After investigation, this is sequence of events:
> # zk-1 is the leader and gets restarted
> # zk-2 is voted as leader
> # zk-0 tries to connect to zk-2:2888 to sync-up. The TCP connection is
> established, though it's remotely closed immediately after
> # zk-2 starts listening to port 2888 but never receives any connection
> # zk-0, after the read error on the connection goes back into LOOKING mode,
> ready for a new leader election
> # zk-2 is still waiting for follower to sync-up, at waits until the timeout
> expires (eg: 30sec) after which it goes back into looking state.
> This sequence might get repeated several times until finally one leader
> election round can get through.
> h3. Logs excerpt:
> ZK-2 becomes leader:
> {noformat}
> 2020-08-27 16:40:07.216000+00:00 [INFO ] [che.zookeeper.server.quorum.Leader]
> LEADING - LEADER ELECTION TOOK - 214 MS
> 2020-08-27 16:40:07.218000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]
> Peer state changed: leading - discovery
> 2020-08-27 16:40:07.218000+00:00 [INFO ] [.server.persistence.FileTxnSnapLog]
> Snapshotting: 0xf00000650 to
> /streamlio/zookeeper/data/version-2/snapshot.f00000650
> 2020-08-27 16:40:07.249000+00:00 [INFO ] [e.zookeeper.server.ZooKeeperServer]
> Snapshot taken in 30 ms
> {noformat}
>
> ZK-0 is following but immediately goes back into LOOKING state:
> {noformat}
> 2020-08-27 16:40:07.207000+00:00 [INFO ] [he.zookeeper.server.quorum.Learner]
> FOLLOWING - LEADER ELECTION TOOK - 211 MS
> 2020-08-27 16:40:07.208000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]
> Peer state changed: following - discovery
> 2020-08-27 16:40:07.208000+00:00 [INFO ] [he.zookeeper.server.quorum.Learner]
> Successfully connected to leader, using address:
> zookeeper-zk35-2.zookeeper-zk35.pulsar-developers.svc.cluster.local/100.101.166.47:2888
> 2020-08-27 16:40:07.214000+00:00 [WARN ] [he.zookeeper.server.quorum.Learner]
> Exception when following the leader
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:186) ~[?:?]
> at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
> at java.io.BufferedInputStream.read(BufferedInputStream.java:271) ~[?:?]
> at java.io.DataInputStream.readInt(DataInputStream.java:392) ~[?:?]
> at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
> at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
> at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:182)
> at
> org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:451)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:89)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1458)
> 2020-08-27 16:40:07.215000+00:00 [INFO ] [he.zookeeper.server.quorum.Learner]
> Disconnected from leader (with address:
> zookeeper-zk35-2.zookeeper-zk35.pulsar-developers.svc.cluster.local/100.101.166.47:2888).
> Was connected for 6ms. Sync state: false
> 2020-08-27 16:40:07.215000+00:00 [INFO ] [he.zookeeper.server.quorum.Learner]
> shutdown Follower
> 2020-08-27 16:40:07.215000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]
> Peer state changed: looking
> 2020-08-27 16:40:07.215000+00:00 [WARN ] [zookeeper.server.quorum.QuorumPeer]
> PeerState set to LOOKING
> 2020-08-27 16:40:07.215000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]
> LOOKING
> 2020-08-27 16:40:07.215000+00:00 [INFO ] [r.server.quorum.FastLeaderElection]
> New election. My id = 1, proposed zxid=0xf00000650
> {noformat}
> After timeout, ZK-2 goes back into looking and a new leader election takes
> place:
> {noformat}
> 2020-08-27 16:40:27.251000+00:00 [WARN ] [zookeeper.server.quorum.QuorumPeer]
> Unexpected exception
> java.lang.InterruptedException: Timeout while waiting for epoch from
> quorum
> at
> org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:1428)
> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:599)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1471)
> 2020-08-27 16:40:27.251000+00:00 [INFO ] [che.zookeeper.server.quorum.Leader]
> Shutting down
> 2020-08-27 16:40:27.251000+00:00 [INFO ] [che.zookeeper.server.quorum.Leader]
> Shutdown called. For the reason Forcing shutdown
> 2020-08-27 16:40:27.251000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]
> Peer state changed: looking
> 2020-08-27 16:40:27.252000+00:00 [WARN ] [zookeeper.server.quorum.QuorumPeer]
> PeerState set to LOOKING
> 2020-08-27 16:40:27.252000+00:00 [INFO ] [zookeeper.server.quorum.QuorumPeer]
> LOOKING
> {noformat}
>
> The main issue that triggers this is that there is an intrinsic race
> condition between the leader to start listening on port 2888 and the
> followers trying to connect to it. In normal deployment scenario, the
> followers will get a connection refused error, on which they will be retrying
> up to 5 times.
> Instead, with a TCP proxy in between, the follower is able to establish the
> TCP connection (to the proxy) but this connection will be terminated
> immediately as we try to read or write on the socket, since when the proxy
> connects to the leader that will be failing.
> One solution here, would be to apply the same retry logic, within the
> boundaries of the timeout also when we get a read error and not only on
> connection error.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)