[ https://issues.apache.org/jira/browse/ZOOKEEPER-4394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872050#comment-17872050 ]
Benoit Sigoure edited comment on ZOOKEEPER-4394 at 8/8/24 3:07 PM: ------------------------------------------------------------------- One of our dev clusters ran into this bug. Quorum: zookeeper-0, zookeeper-1(leader), zookeeper-2 Client app holding an ephemeral znode exits: {code:java} # zookeeper-0 log 2024-08-07 15:22:58,929 [myid:] - INFO [NIOWorkerThread-4:o.a.z.s.NIOServerCnxn@337] - Unable to read additional data from client, it probably closed the socket: address = /10.128.31.11:42544, session = 0x30019eb942d0028 {code} shortly thereafter: {code:java} # zookeeper-2 log 2024-08-07 15:23:00,009 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@922] - Peer state changed: following - synchronization 2024-08-07 15:23:00,010 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@490] - Configuring CommitProcessor with readBatchSize -1 commitBatchSize 1 2024-08-07 15:23:00,010 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@451] - Configuring CommitProcessor with 8 worker threads. 2024-08-07 15:23:00,010 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@59] - Initialized FollowerRequestProcessor with zookeeper.follower.skipLearnerRequestToNextProcessor as false 2024-08-07 15:23:00,011 [myid:] - WARN [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.j.MBeanRegistry@110] - Failed to register MBean InMemoryDataTree 2024-08-07 15:23:00,011 [myid:] - WARN [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.LearnerZooKeeperServer@104] - Failed to register with JMX javax.management.InstanceAlreadyExistsException: org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Follower,name3=InMemoryDataTree at java.management/com.sun.jmx.mbeanserver.Repository.addMBean(Unknown Source) at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(Unknown Source) at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(Unknown Source) at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(Unknown Source) at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(Unknown Source) at java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(Unknown Source) at org.apache.zookeeper.jmx.MBeanRegistry.register(MBeanRegistry.java:106) at org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.registerJMX(LearnerZooKeeperServer.java:102) at org.apache.zookeeper.server.ZooKeeperServer.startupWithServerState(ZooKeeperServer.java:730) at org.apache.zookeeper.server.ZooKeeperServer.startupWithoutServing(ZooKeeperServer.java:713) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:760) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:108) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539) 2024-08-07 15:23:00,012 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Learner@721] - Learner received UPTODATE message 2024-08-07 15:23:00,013 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@917] - Peer state changed: following - broadcast 2024-08-07 15:23:00,015 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49bc 2024-08-07 15:23:00,015 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49bd 2024-08-07 15:23:00,016 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49be 2024-08-07 15:23:00,103 [myid:] - WARN [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@169] - Got zxid 0x632029b34c expected 0x1 2024-08-07 15:23:00,604 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49bf 2024-08-07 15:23:02,202 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49c0 2024-08-07 15:23:05,426 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x200001b71c42ddb 2024-08-07 15:23:05,671 [myid:] - INFO [NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from /127.0.0.1:48192 2024-08-07 15:23:05,671 [myid:] - INFO [NIOWorkerThread-8:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from /127.0.0.1:48184 2024-08-07 15:23:07,087 [myid:] - WARN [SyncThread:3:o.a.z.s.p.FileTxnLog@394] - fsync-ing the write ahead log in SyncThread:3 took 7075ms which will adversely effect operation latency.File size is 67108880 bytes. See the ZooKeeper troubleshooting guide 2024-08-07 15:23:10,107 [myid:] - WARN [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@128] - Exception when following the leader java.io.EOFException: null at java.base/java.io.DataInputStream.readInt(Unknown Source) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:228) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:124) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539) 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@142] - Disconnected from leader (with address: zookeeper-1-zookeeper.zookeeper.svc.cluster.local./172.29.203.237:2888). Was connected for 17136ms. Sync state: true 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@286] - shutdown Follower 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.ZooKeeperServer@853] - shutting down 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.RequestThrottler@256] - Shutting down 2024-08-07 15:23:10,108 [myid:] - INFO [RequestThrottler:o.a.z.s.RequestThrottler@217] - Draining request throttler queue 2024-08-07 15:23:10,108 [myid:] - INFO [RequestThrottler:o.a.z.s.RequestThrottler@195] - RequestThrottler shutdown. Dropped 0 requests 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@172] - Shutting down 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@631] - Shutting down 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.FinalRequestProcessor@684] - shutdown of request processor complete 2024-08-07 15:23:10,108 [myid:] - INFO [FollowerRequestProcessor:3:o.a.z.s.q.FollowerRequestProcessor@128] - FollowerRequestProcessor exited loop! 2024-08-07 15:23:10,108 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.CommitProcessor@419] - CommitProcessor exited loop! 2024-08-07 15:23:10,191 [myid:] - ERROR [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.p.Util@166] - Last transaction was partial. {code} Resulting state: ephemeral znode held by 10.128.31.11:42544 was correctly removed on zookeeper-0 and zookeeper-1 but lingers on zookeeper-2. Env: {code} 2024-08-07 16:18:46,793 [myid:3] - INFO [main:o.a.z.Environment@98] - Server environment:zookeeper.version=3.8.2-139d619b58292d7734b4fc83a0f44be4e7b0c986, built on 2023-07-05 19:24 UTC 2024-08-07 16:18:46,793 [myid:3] - INFO [main:o.a.z.Environment@98] - Server environment:host.name=zookeeper-2.zookeeper.zookeeper.svc.cluster.local 2024-08-07 16:18:46,794 [myid:3] - INFO [main:o.a.z.Environment@98] - Server environment:java.version=11.0.20.1 2024-08-07 16:18:46,794 [myid:3] - INFO [main:o.a.z.Environment@98] - Server environment:java.vendor=Eclipse Adoptium 2024-08-07 16:18:46,794 [myid:3] - INFO [main:o.a.z.Environment@98] - Server environment:java.home=/opt/java/openjdk {code} was (Author: tsuna): One of our dev clusters ran into this bug. Quorum: zookeeper-0, zookeeper-1(leader), zookeeper-2 Client app holding an ephemeral znode exits: {code:java} # zookeeper-0 log 2024-08-07 15:22:58,929 [myid:] - INFO [NIOWorkerThread-4:o.a.z.s.NIOServerCnxn@337] - Unable to read additional data from client, it probably closed the socket: address = /10.128.31.11:42544, session = 0x30019eb942d0028 {code} shortly thereafter: {code:java} # zookeeper-2 log 2024-08-07 15:23:00,009 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@922] - Peer state changed: following - synchronization 2024-08-07 15:23:00,010 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@490] - Configuring CommitProcessor with readBatchSize -1 commitBatchSize 1 2024-08-07 15:23:00,010 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@451] - Configuring CommitProcessor with 8 worker threads. 2024-08-07 15:23:00,010 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@59] - Initialized FollowerRequestProcessor with zookeeper.follower.skipLearnerRequestToNextProcessor as false 2024-08-07 15:23:00,011 [myid:] - WARN [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.j.MBeanRegistry@110] - Failed to register MBean InMemoryDataTree 2024-08-07 15:23:00,011 [myid:] - WARN [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.LearnerZooKeeperServer@104] - Failed to register with JMX javax.management.InstanceAlreadyExistsException: org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Follower,name3=InMemoryDataTree at java.management/com.sun.jmx.mbeanserver.Repository.addMBean(Unknown Source) at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(Unknown Source) at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(Unknown Source) at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(Unknown Source) at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(Unknown Source) at java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(Unknown Source) at org.apache.zookeeper.jmx.MBeanRegistry.register(MBeanRegistry.java:106) at org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.registerJMX(LearnerZooKeeperServer.java:102) at org.apache.zookeeper.server.ZooKeeperServer.startupWithServerState(ZooKeeperServer.java:730) at org.apache.zookeeper.server.ZooKeeperServer.startupWithoutServing(ZooKeeperServer.java:713) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:760) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:108) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539) 2024-08-07 15:23:00,012 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Learner@721] - Learner received UPTODATE message 2024-08-07 15:23:00,013 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@917] - Peer state changed: following - broadcast 2024-08-07 15:23:00,015 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49bc 2024-08-07 15:23:00,015 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49bd 2024-08-07 15:23:00,016 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49be 2024-08-07 15:23:00,103 [myid:] - WARN [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@169] - Got zxid 0x632029b34c expected 0x1 2024-08-07 15:23:00,604 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49bf 2024-08-07 15:23:02,202 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x10000242c8c49c0 2024-08-07 15:23:05,426 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global session 0x200001b71c42ddb 2024-08-07 15:23:05,671 [myid:] - INFO [NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from /127.0.0.1:48192 2024-08-07 15:23:05,671 [myid:] - INFO [NIOWorkerThread-8:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from /127.0.0.1:48184 2024-08-07 15:23:07,087 [myid:] - WARN [SyncThread:3:o.a.z.s.p.FileTxnLog@394] - fsync-ing the write ahead log in SyncThread:3 took 7075ms which will adversely effect operation latency.File size is 67108880 bytes. See the ZooKeeper troubleshooting guide 2024-08-07 15:23:10,107 [myid:] - WARN [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@128] - Exception when following the leader java.io.EOFException: null at java.base/java.io.DataInputStream.readInt(Unknown Source) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:228) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:124) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539) 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@142] - Disconnected from leader (with address: zookeeper-1-zookeeper.zookeeper.svc.cluster.local./172.29.203.237:2888). Was connected for 17136ms. Sync state: true 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@286] - shutdown Follower 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.ZooKeeperServer@853] - shutting down 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.RequestThrottler@256] - Shutting down 2024-08-07 15:23:10,108 [myid:] - INFO [RequestThrottler:o.a.z.s.RequestThrottler@217] - Draining request throttler queue 2024-08-07 15:23:10,108 [myid:] - INFO [RequestThrottler:o.a.z.s.RequestThrottler@195] - RequestThrottler shutdown. Dropped 0 requests 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@172] - Shutting down 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@631] - Shutting down 2024-08-07 15:23:10,108 [myid:] - INFO [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.FinalRequestProcessor@684] - shutdown of request processor complete 2024-08-07 15:23:10,108 [myid:] - INFO [FollowerRequestProcessor:3:o.a.z.s.q.FollowerRequestProcessor@128] - FollowerRequestProcessor exited loop! 2024-08-07 15:23:10,108 [myid:] - INFO [CommitProcessor:3:o.a.z.s.q.CommitProcessor@419] - CommitProcessor exited loop! 2024-08-07 15:23:10,191 [myid:] - ERROR [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.p.Util@166] - Last transaction was partial. {code} Resulting state: ephemeral znode held by 10.128.31.11:42544 was correctly removed on zookeeper-0 and zookeeper-1 but lingers on zookeeper-2. > Learner.syncWithLeader got NullPointerException > ----------------------------------------------- > > Key: ZOOKEEPER-4394 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4394 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.7.0 > Environment: ZooKeeper 3.7.0 > Reporter: Liu Haifeng > Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > ZooKeeper follower node encountered NullPointerException during > syncWithLeader. > Logs indicate that the follower has received NEWLEADER packet between a > PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet > leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do > packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the > later if-statement raised NPE. > {code:java} > case Leader.COMMIT: > case Leader.COMMITANDACTIVATE: > pif = packetsNotCommitted.peekFirst(); > if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() == > Leader.COMMITANDACTIVATE) { > // ... > }{code} > After look into the Leader side, I found: > # LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog > (PROPOSAL/COMMIT pairs); > # Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs); > # Leader.startForwarding queues outstandingProposals packets(PROSOAL only); > # LeanerHandler.run sends NEWLEADER message. > Seams if the outstandingProposals is not empty at the certain moment, the > follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order. > The follower will retry from LOOKING again and is expected to be succeed at > last, however, under heavy load it may be too many retries. Further more, I > my case the follower has to sync data from leader's disk, and start over > again after the NPE(prior sync not flushed?), which may harm the leader. > I don't know if it is designed so or not, but consider the performance, can > we at least avoid wasting of network/disk IO? -- This message was sent by Atlassian Jira (v8.20.10#820010)