[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MAJ updated ZOOKEEPER-5051:
---------------------------
    Description: 
When a 4-letter command (e.g. ruok, stat) is sent to ZooKeeper during server 
startup, the connection close path can trigger a NullPointerException in 
ZooKeeperServer.removeCnxn().

*Root cause*

 
4-letter commands are handled before full session establishment and may be 
processed while the server is still starting.
During this phase:
 - ZooKeeperServer.zkDb is not yet initialized
 - The connection close path still calls ZooKeeperServer.removeCnxn()
 - removeCnxn unconditionally dereferences zkDb, leading to NPE
 
This occurs in a race window between:
 - Accepting/processing 4lw commands
 - ZooKeeperServer initialization, specifically startdata() call that sets it.

*Impact*
 - Client connections invoking 4lw commands during startup may hang.
 - Server logs contain intermittent NPEs.
 - Connection cleanup is incomplete, leaving socket open on client side This 
can affect monitoring systems (ruok/stat/cons checks) that probe the server 
early during startup.

*Proposed fix*

I plan to submit the fix against master by adding a guarding 
ZooKeeperServer.removeCnxn() against null zkDb:
If zkDb is null, skip removal since no DB tracking exists yet.

*Reproduction/Emulation*
 
Start ZooKeeper and simultaneously send repeated 4lw commands:
 
{code:java}
while true; do printf ruok | nc localhost 2181; done{code}
 
During startup, intermittent NPEs are observed. Clients without no timeouts 
hang.

*Testing*

The issue is timing-dependent and difficult to reproduce deterministically in a 
unit test.
A simple and deterministic regression test is proposed that verifies 
removeCnxn() is safe when zkDb == null.

*Guilty stack*
{code:java}
2026-05-13 10:11:06,025 [NIOWorkerThread-1] 
org.apache.zookeeper.server.ServerCnxn : ERROR - [myid:] Error closing a 
command socket 
java.lang.NullPointerException: Cannot invoke 
"org.apache.zookeeper.server.ZKDatabase.removeCnxn(org.apache.zookeeper.server.ServerCnxn)"
 because "this.zkDb" is null
        at 
org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:333)
        at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:614)
        at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:604)
        at 
org.apache.zookeeper.server.ServerCnxn.cleanupWriterSocket(ServerCnxn.java:593)
        at 
org.apache.zookeeper.server.command.AbstractFourLetterCommand.run(AbstractFourLetterCommand.java:60)
        at 
org.apache.zookeeper.server.command.AbstractFourLetterCommand.start(AbstractFourLetterCommand.java:51)
        at 
org.apache.zookeeper.server.command.CommandExecutor.execute(CommandExecutor.java:45)
        at 
org.apache.zookeeper.server.NIOServerCnxn.checkFourLetterWord(NIOServerCnxn.java:548)
        at 
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:562)
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:352)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
        at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583) {code}
 

  was:
When a 4-letter command (e.g. ruok, stat) is sent to ZooKeeper during server 
startup, the connection close path can trigger a NullPointerException in 
ZooKeeperServer.removeCnxn().

*Root cause*

 
4-letter commands are handled before full session establishment and may be 
processed while the server is still starting.
During this phase:
 - ZooKeeperServer.zkDb is not yet initialized
 - The connection close path still calls ZooKeeperServer.removeCnxn()
 - removeCnxn unconditionally dereferences zkDb, leading to NPE
 
This occurs in a race window between:
 - Accepting/processing 4lw commands
 - ZooKeeperServer initialization, specifically startdata() call that sets it.

*Impact*
 - Client connections invoking 4lw commands during startup may hang.
 - Server logs contain intermittent NPEs.
 - Connection cleanup is incomplete, leaving socket open on client side This 
can affect monitoring systems (ruok/stat/cons checks) that probe the server 
early during startup.

*Proposed fix*

I plan to submit the fix against master by adding a guarding 
ZooKeeperServer.removeCnxn() against null zkDb:
If zkDb is null, skip removal since no DB tracking exists yet.

*Reproduction/Emulation*
 
Start ZooKeeper and simultaneously send repeated 4lw commands:
 
{code:java}
while true; do printf ruok | nc localhost 2181; done{code}
 
During startup, intermittent NPEs are observed.

*Testing*

The issue is timing-dependent and difficult to reproduce deterministically in a 
unit test.
A simple and deterministic regression test is proposed that verifies 
removeCnxn() is safe when zkDb == null.

*Guilty stack*
{code:java}
2026-05-13 10:11:06,025 [NIOWorkerThread-1] 
org.apache.zookeeper.server.ServerCnxn : ERROR - [myid:] Error closing a 
command socket 
java.lang.NullPointerException: Cannot invoke 
"org.apache.zookeeper.server.ZKDatabase.removeCnxn(org.apache.zookeeper.server.ServerCnxn)"
 because "this.zkDb" is null
        at 
org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:333)
        at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:614)
        at 
org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:604)
        at 
org.apache.zookeeper.server.ServerCnxn.cleanupWriterSocket(ServerCnxn.java:593)
        at 
org.apache.zookeeper.server.command.AbstractFourLetterCommand.run(AbstractFourLetterCommand.java:60)
        at 
org.apache.zookeeper.server.command.AbstractFourLetterCommand.start(AbstractFourLetterCommand.java:51)
        at 
org.apache.zookeeper.server.command.CommandExecutor.execute(CommandExecutor.java:45)
        at 
org.apache.zookeeper.server.NIOServerCnxn.checkFourLetterWord(NIOServerCnxn.java:548)
        at 
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:562)
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:352)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
        at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583) {code}
 


> 4lw commands during startup can trigger NPE in ZooKeeperServer.removeCnxn and 
> leave connections hanging
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-5051
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5051
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.9.5
>         Environment:  
> ZooKeeper 3.9.5
> Java 21.0.6
> NIOServerCnxn
> 4-letter-word commands during server startup
>  
>            Reporter: MAJ
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a 4-letter command (e.g. ruok, stat) is sent to ZooKeeper during server 
> startup, the connection close path can trigger a NullPointerException in 
> ZooKeeperServer.removeCnxn().
> *Root cause*
>  
> 4-letter commands are handled before full session establishment and may be 
> processed while the server is still starting.
> During this phase:
>  - ZooKeeperServer.zkDb is not yet initialized
>  - The connection close path still calls ZooKeeperServer.removeCnxn()
>  - removeCnxn unconditionally dereferences zkDb, leading to NPE
>  
> This occurs in a race window between:
>  - Accepting/processing 4lw commands
>  - ZooKeeperServer initialization, specifically startdata() call that sets it.
> *Impact*
>  - Client connections invoking 4lw commands during startup may hang.
>  - Server logs contain intermittent NPEs.
>  - Connection cleanup is incomplete, leaving socket open on client side This 
> can affect monitoring systems (ruok/stat/cons checks) that probe the server 
> early during startup.
> *Proposed fix*
> I plan to submit the fix against master by adding a guarding 
> ZooKeeperServer.removeCnxn() against null zkDb:
> If zkDb is null, skip removal since no DB tracking exists yet.
> *Reproduction/Emulation*
>  
> Start ZooKeeper and simultaneously send repeated 4lw commands:
>  
> {code:java}
> while true; do printf ruok | nc localhost 2181; done{code}
>  
> During startup, intermittent NPEs are observed. Clients without no timeouts 
> hang.
> *Testing*
> The issue is timing-dependent and difficult to reproduce deterministically in 
> a unit test.
> A simple and deterministic regression test is proposed that verifies 
> removeCnxn() is safe when zkDb == null.
> *Guilty stack*
> {code:java}
> 2026-05-13 10:11:06,025 [NIOWorkerThread-1] 
> org.apache.zookeeper.server.ServerCnxn : ERROR - [myid:] Error closing a 
> command socket 
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.zookeeper.server.ZKDatabase.removeCnxn(org.apache.zookeeper.server.ServerCnxn)"
>  because "this.zkDb" is null
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:333)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:614)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:604)
>         at 
> org.apache.zookeeper.server.ServerCnxn.cleanupWriterSocket(ServerCnxn.java:593)
>         at 
> org.apache.zookeeper.server.command.AbstractFourLetterCommand.run(AbstractFourLetterCommand.java:60)
>         at 
> org.apache.zookeeper.server.command.AbstractFourLetterCommand.start(AbstractFourLetterCommand.java:51)
>         at 
> org.apache.zookeeper.server.command.CommandExecutor.execute(CommandExecutor.java:45)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.checkFourLetterWord(NIOServerCnxn.java:548)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:562)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:352)
>         at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
>         at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>         at java.base/java.lang.Thread.run(Thread.java:1583) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to