MAJ created ZOOKEEPER-5051:
------------------------------
Summary: 4lw commands during startup can trigger NPE in
ZooKeeperServer.removeCnxn and leave connections hanging
Key: ZOOKEEPER-5051
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5051
Project: ZooKeeper
Issue Type: Bug
Components: server
Affects Versions: 3.9.5
Environment:
ZooKeeper 3.9.5
Java 21.0.6
NIOServerCnxn
4-letter-word commands during server startup
Reporter: MAJ
When a 4-letter command (e.g. ruok, stat) is sent to ZooKeeper during server
startup, the connection close path can trigger a NullPointerException in
ZooKeeperServer.removeCnxn().
*Root cause*
4-letter commands are handled before full session establishment and may be
processed while the server is still starting.
During this phase:
- ZooKeeperServer.zkDb is not yet initialized
- The connection close path still calls ZooKeeperServer.removeCnxn()
- removeCnxn unconditionally dereferences zkDb, leading to NPE
This occurs in a race window between:
- Accepting/processing 4lw commands
- ZooKeeperServer initialization, specifically startdata() call that sets it.
*Impact*
- Client connections invoking 4lw commands during startup may hang.
- Server logs contain intermittent NPEs.
- Connection cleanup is incomplete, leaving socket open on client side This can
affect monitoring systems (ruok/stat/cons checks) that probe the server early
during startup.
*Proposed fix*
I plan to submit the fix against master by adding a guarding
ZooKeeperServer.removeCnxn() against null zkDb:
If zkDb is null, skip removal since no DB tracking exists yet.
*Reproduction/Emulation*
Start ZooKeeper and simultaneously send repeated 4lw commands:
while true; do printf ruok | nc -w 1 localhost 2181; done
During startup, intermittent NPEs are observed.
*Testing*
The issue is timing-dependent and difficult to reproduce deterministically in a
unit test.
A simple and deterministic regression test is proposed that verifies
removeCnxn() is safe when zkDb == null.
*Guilty stack*
{code:java}
2026-05-13 10:11:06,025 [NIOWorkerThread-1]
org.apache.zookeeper.server.ServerCnxn : ERROR - [myid:] Error closing a
command socket
java.lang.NullPointerException: Cannot invoke
"org.apache.zookeeper.server.ZKDatabase.removeCnxn(org.apache.zookeeper.server.ServerCnxn)"
because "this.zkDb" is null
at
org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:333)
at
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:614)
at
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:604)
at
org.apache.zookeeper.server.ServerCnxn.cleanupWriterSocket(ServerCnxn.java:593)
at
org.apache.zookeeper.server.command.AbstractFourLetterCommand.run(AbstractFourLetterCommand.java:60)
at
org.apache.zookeeper.server.command.AbstractFourLetterCommand.start(AbstractFourLetterCommand.java:51)
at
org.apache.zookeeper.server.command.CommandExecutor.execute(CommandExecutor.java:45)
at
org.apache.zookeeper.server.NIOServerCnxn.checkFourLetterWord(NIOServerCnxn.java:548)
at
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:562)
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:352)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
at
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583) {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)