MAJ created ZOOKEEPER-5051:
------------------------------

             Summary: 4lw commands during startup can trigger NPE in 
ZooKeeperServer.removeCnxn and leave connections hanging
                 Key: ZOOKEEPER-5051
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5051
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.9.5
         Environment:  


ZooKeeper 3.9.5
Java 21.0.6
NIOServerCnxn
4-letter-word commands during server startup
 
            Reporter: MAJ


When a 4-letter command (e.g. ruok, stat) is sent to ZooKeeper during server 
startup, the connection close path can trigger a NullPointerException in 
ZooKeeperServer.removeCnxn().

*Root cause*

 
4-letter commands are handled before full session establishment and may be 
processed while the server is still starting.
During this phase:
- ZooKeeperServer.zkDb is not yet initialized
- The connection close path still calls ZooKeeperServer.removeCnxn()
- removeCnxn unconditionally dereferences zkDb, leading to NPE
 
This occurs in a race window between:
- Accepting/processing 4lw commands
- ZooKeeperServer initialization, specifically startdata() call that sets it.

*Impact*


- Client connections invoking 4lw commands during startup may hang.
- Server logs contain intermittent NPEs.
- Connection cleanup is incomplete, leaving socket open on client side This can 
affect monitoring systems (ruok/stat/cons checks) that probe the server early 
during startup.
*Proposed fix*

 
I plan to submit the fix against master by adding a guarding 
ZooKeeperServer.removeCnxn() against null zkDb:
If zkDb is null, skip removal since no DB tracking exists yet.
*Reproduction/Emulation*
 
Start ZooKeeper and simultaneously send repeated 4lw commands:
 
while true; do printf ruok | nc -w 1 localhost 2181; done
 
During startup, intermittent NPEs are observed.

*Testing*


The issue is timing-dependent and difficult to reproduce deterministically in a 
unit test.
A simple and deterministic regression test is proposed that verifies 
removeCnxn() is safe when zkDb == null.


*Guilty stack*
{code:java}
2026-05-13 10:11:06,025 [NIOWorkerThread-1] 
org.apache.zookeeper.server.ServerCnxn : ERROR - [myid:] Error closing a 
command socket 
java.lang.NullPointerException: Cannot invoke 
"org.apache.zookeeper.server.ZKDatabase.removeCnxn(org.apache.zookeeper.server.ServerCnxn)"
 because "this.zkDb" is null
        at 
org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:333)
        at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:614)
        at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:604)
        at 
org.apache.zookeeper.server.ServerCnxn.cleanupWriterSocket(ServerCnxn.java:593)
        at 
org.apache.zookeeper.server.command.AbstractFourLetterCommand.run(AbstractFourLetterCommand.java:60)
        at 
org.apache.zookeeper.server.command.AbstractFourLetterCommand.start(AbstractFourLetterCommand.java:51)
        at 
org.apache.zookeeper.server.command.CommandExecutor.execute(CommandExecutor.java:45)
        at 
org.apache.zookeeper.server.NIOServerCnxn.checkFourLetterWord(NIOServerCnxn.java:548)
        at 
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:562)
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:352)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
        at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583) {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to