[
https://issues.apache.org/jira/browse/ZOOKEEPER-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
MAJ updated ZOOKEEPER-5051:
---------------------------
Pull request submitted:
https://github.com/apache/zookeeper/pull/2391
> 4lw commands during startup can trigger NPE in ZooKeeperServer.removeCnxn and
> leave connections hanging
> -------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-5051
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5051
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.9.5
> Environment:
> ZooKeeper 3.9.5
> Java 21.0.6
> NIOServerCnxn
> 4-letter-word commands during server startup
>
> Reporter: MAJ
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When a 4-letter command (e.g. ruok, stat) is sent to ZooKeeper during server
> startup, the connection close path can trigger a NullPointerException in
> ZooKeeperServer.removeCnxn().
> *Root cause*
>
> 4-letter commands are handled before full session establishment and may be
> processed while the server is still starting.
> During this phase:
> - ZooKeeperServer.zkDb is not yet initialized
> - The connection close path still calls ZooKeeperServer.removeCnxn()
> - removeCnxn unconditionally dereferences zkDb, leading to NPE
>
> This occurs in a race window between:
> - Accepting/processing 4lw commands
> - ZooKeeperServer initialization, specifically startdata() call that sets it.
> *Impact*
> - Client connections invoking 4lw commands during startup may hang.
> - Server logs contain intermittent NPEs.
> - Connection cleanup is incomplete, leaving socket open on client side This
> can affect monitoring systems (ruok/stat/cons checks) that probe the server
> early during startup.
> *Proposed fix*
> I plan to submit the fix against master by adding a guarding
> ZooKeeperServer.removeCnxn() against null zkDb:
> If zkDb is null, skip removal since no DB tracking exists yet.
> *Reproduction/Emulation*
>
> Start ZooKeeper and simultaneously send repeated 4lw commands:
>
> {code:java}
> while true; do printf ruok | nc localhost 2181; done{code}
>
> During startup, intermittent NPEs are observed.
> *Testing*
> The issue is timing-dependent and difficult to reproduce deterministically in
> a unit test.
> A simple and deterministic regression test is proposed that verifies
> removeCnxn() is safe when zkDb == null.
> *Guilty stack*
> {code:java}
> 2026-05-13 10:11:06,025 [NIOWorkerThread-1]
> org.apache.zookeeper.server.ServerCnxn : ERROR - [myid:] Error closing a
> command socket
> java.lang.NullPointerException: Cannot invoke
> "org.apache.zookeeper.server.ZKDatabase.removeCnxn(org.apache.zookeeper.server.ServerCnxn)"
> because "this.zkDb" is null
> at
> org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:333)
> at
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:614)
> at
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:604)
> at
> org.apache.zookeeper.server.ServerCnxn.cleanupWriterSocket(ServerCnxn.java:593)
> at
> org.apache.zookeeper.server.command.AbstractFourLetterCommand.run(AbstractFourLetterCommand.java:60)
> at
> org.apache.zookeeper.server.command.AbstractFourLetterCommand.start(AbstractFourLetterCommand.java:51)
> at
> org.apache.zookeeper.server.command.CommandExecutor.execute(CommandExecutor.java:45)
> at
> org.apache.zookeeper.server.NIOServerCnxn.checkFourLetterWord(NIOServerCnxn.java:548)
> at
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:562)
> at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:352)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
> at
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583) {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)