zhanglu153 created ZOOKEEPER-4781: ------------------------------------- Summary: ZooKeeper not starting because the accepted epoch is less than the current epoch. Key: ZOOKEEPER-4781 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4781 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.9.1, 3.8.3, 3.7.2, 3.6.4, 3.4.14, 3.5.10 Reporter: zhanglu153
This issue occurred in our abnormal testing environment, where the disk was injected with anomalies and frequently filled up. The the scenario is as follows: # Configure three node ZooKeeper cluster, lets say nodes are A, B and C. # Start the cluster, and node C becomes the leader. # The disk of Node C is injected with a full disk exception. # Node C called the org.apache.zookeeper.server.quorum.Leader#lead method. {code:java} void lead() throws IOException, InterruptedException { self.end_fle = Time.currentElapsedTime(); long electionTimeTaken = self.end_fle - self.start_fle; self.setElectionTimeTaken(electionTimeTaken); LOG.info("LEADING - LEADER ELECTION TOOK - {}", electionTimeTaken); self.start_fle = 0; self.end_fle = 0; zk.registerJMX(new LeaderBean(this, zk), self.jmxLocalPeerBean); try { self.tick.set(0); zk.loadData(); leaderStateSummary = new StateSummary(self.getCurrentEpoch(), zk.getLastProcessedZxid()); // Start thread that waits for connection requests from // new followers. cnxAcceptor = new LearnerCnxAcceptor(); cnxAcceptor.start(); readyToStart = true; long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch()); zk.setZxid(ZxidUtils.makeZxid(epoch, 0)); synchronized(this){ lastProposed = zk.getZxid(); } newLeaderProposal.packet = new QuorumPacket(NEWLEADER, zk.getZxid(), null, null); if ((newLeaderProposal.packet.getZxid() & 0xffffffffL) != 0) { LOG.info("NEWLEADER proposal has Zxid of " + Long.toHexString(newLeaderProposal.packet.getZxid())); } waitForEpochAck(self.getId(), leaderStateSummary); self.setCurrentEpoch(epoch); ... {code} # Node C, as the leader, will start the LearnerCnxAcceptor thread, and then call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method. At this time, the value of waitingForNewEpoch is true, and the size of connectingFollowers is not greater than n/2. Node C directly calls connectingFollowers.wait to wait. The maximum waiting time is self.getInitLimit()*self.getTickTime() ms. {code:java} public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException { synchronized(connectingFollowers) { if (!waitingForNewEpoch) { return epoch; } if (lastAcceptedEpoch >= epoch) { epoch = lastAcceptedEpoch+1; } if (isParticipant(sid)) { connectingFollowers.add(sid); } QuorumVerifier verifier = self.getQuorumVerifier(); if (connectingFollowers.contains(self.getId()) && verifier.containsQuorum(connectingFollowers)) { self.setAcceptedEpoch(epoch); waitingForNewEpoch = false; connectingFollowers.notifyAll(); } else { long start = Time.currentElapsedTime(); long cur = start; long end = start + self.getInitLimit()*self.getTickTime(); while(waitingForNewEpoch && cur < end) { connectingFollowers.wait(end - cur); cur = Time.currentElapsedTime(); } if (waitingForNewEpoch) { throw new InterruptedException("Timeout while waiting for epoch from quorum"); } } return epoch; } } {code} # Node B connects to the 2888 communication port of node C and starts a new LeanerHandler thread. # Node A connects to the 2888 communication port of node C and starts a new LeanerHandler thread. # After node B connects to node C, call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the LearnerHandler thread.At this point, the value of waitingForNewEpoch is true, and the size of connectingFollowers is greater than n/2. Then, set the value of waitingForNewEpoch to false. Due to the disk of node C being full, calling setAcceptedEpoch to write the acceptedEpoch value failed with an IO exception. Node C fails to update the acceptedEpoch file and did not successfully call the connectingFollowers.notifyAll() method. This will cause node C to wait at connectingFollowers.wait, with a maximum wait of self.getInitLimit()*self.getTickTime() ms. Due to an IO exception thrown, the socket connection between node B and node C was disconnected, and the LeanerHandler thread exited. {code:java} 2023-12-15 15:08:35,572 [myid:3] - ERROR [LearnerHandler-/192.168.174.227:41696:LearnerHandler@648] - Unexpected exception causing shutdown while sock still open2023-12-15 15:08:35,572 [myid:3] - ERROR [LearnerHandler-/192.168.174.227:41696:LearnerHandler@648] - Unexpected exception causing shutdown while sock still openjava.io.FileNotFoundException: /cloud/data/zookeeper/data/version-2/acceptedEpoch.tmp (No space left on device) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) at org.apache.zookeeper.common.AtomicFileOutputStream.<init>(AtomicFileOutputStream.java:59) at org.apache.zookeeper.server.quorum.QuorumPeer.writeLongToFile(QuorumPeer.java:1393) at org.apache.zookeeper.server.quorum.QuorumPeer.setAcceptedEpoch(QuorumPeer.java:1437) at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:885) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:358) {code} # After node A connects to node C, the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method is called in the LearnerHandler thread. At this time, the value of waitingForNewEpoch is false, and epoch is returned directly. # Node B reconnects to the 2888 cluster communication port of node C and starts a new LeanerHandler thread. Call the org.apache.zookeeper.server.quorum.Leader#getEpochToPropose method in the LearnerHandler thread. At this time, the value of waitingForNewEpoch is false, and epoch is returned directly. # After node A connects to node C, the LeanerHandler thread will then call the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this point, selectingFollowers does not include node C, Call the selectingFollowers.wait method to wait. # After node B connects to node C, the LeanerHandler thread will then call the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this point, selectingFollowers does not include node C, Call the selectingFollowers.wait method to wait. {code:java} public void waitForEpochAck(long id, StateSummary ss) throws IOException, InterruptedException { synchronized(electingFollowers) { if (electionFinished) { return; } if (ss.getCurrentEpoch() != -1) { if (ss.isMoreRecentThan(leaderStateSummary)) { throw new IOException("Follower is ahead of the leader, leader summary: " + leaderStateSummary.getCurrentEpoch() + " (current epoch), " + leaderStateSummary.getLastZxid() + " (last zxid)"); } if (isParticipant(id)) { electingFollowers.add(id); } } QuorumVerifier verifier = self.getQuorumVerifier(); if (electingFollowers.contains(self.getId()) && verifier.containsQuorum(electingFollowers)) { electionFinished = true; electingFollowers.notifyAll(); } else { long start = Time.currentElapsedTime(); long cur = start; long end = start + self.getInitLimit()*self.getTickTime(); while(!electionFinished && cur < end) { electingFollowers.wait(end - cur); cur = Time.currentElapsedTime(); } if (!electionFinished) { throw new InterruptedException("Timeout while waiting for epoch to be acked by quorum"); } } } } {code} # Node C waits for self.getInitLimit()*self.getTickTime() ms in the connectingFollowers.wait method. However, since the value of waitingForNewEpoch is false, it does not throw a timeout exception and returns the epoch directly. # Node C calls the org.apache.zookeeper.server.quorum.Leader#waitForEpochAck method. At this time, selectingFollowers contains node C and the size of selectingFollowers is greater than n/2. Then call the selectingFollowers.notifyAll() method to release the lock. # The disk space of node C has been partially released, so subsequent calls to setCurrentEpoch have successfully updated the currentEpoch file. # The cluster provides services normally for a period of time. # Stop C. # Start C, bellow exception with message "The accepted epoch, c is less than the current epoch, d" is thrown. {code:java} 2023-12-15 15:57:00,051 [myid:3] - ERROR [main:QuorumPeer@698] - Unable to load database on disk java.io.IOException: The accepted epoch, c is less than the current epoch, d at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:695) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) 2023-12-15 15:57:00,132 [myid:3] - ERROR [main:QuorumPeerMain@92] - Unexpected exception, exiting abnormally java.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:699) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:636) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:170) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:114) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81) Caused by: java.io.IOException: The accepted epoch, c is less than the current epoch, d at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:695) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)