[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18082228#comment-18082228
 ] 

MAJ edited comment on ZOOKEEPER-5051 at 5/26/26 5:35 AM:
---------------------------------------------------------

Pull request submitted:
https://github.com/apache/zookeeper/pull/2391

CI green, ready for review.


was (Author: JIRAUSER313395):
Pull request submitted:
https://github.com/apache/zookeeper/pull/2391


> 4lw commands during startup can trigger NPE in ZooKeeperServer.removeCnxn and 
> leave connections hanging
> -------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-5051
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5051
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.9.5
>         Environment:  
> ZooKeeper 3.9.5
> Java 21.0.6
> NIOServerCnxn
> 4-letter-word commands during server startup
>  
>            Reporter: MAJ
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a 4-letter command (e.g. ruok, stat) is sent to ZooKeeper during server 
> startup, the connection close path can trigger a NullPointerException in 
> ZooKeeperServer.removeCnxn().
> *Root cause*
>  
> 4-letter commands are handled before full session establishment and may be 
> processed while the server is still starting.
> During this phase:
>  - ZooKeeperServer.zkDb is not yet initialized
>  - The connection close path still calls ZooKeeperServer.removeCnxn()
>  - removeCnxn unconditionally dereferences zkDb, leading to NPE
>  
> This occurs in a race window between:
>  - Accepting/processing 4lw commands
>  - ZooKeeperServer initialization, specifically startdata() call that sets it.
> *Impact*
>  - Client connections invoking 4lw commands during startup may hang.
>  - Server logs contain intermittent NPEs.
>  - Connection cleanup is incomplete, leaving socket open on client side This 
> can affect monitoring systems (ruok/stat/cons checks) that probe the server 
> early during startup.
> *Proposed fix*
> I plan to submit the fix against master by adding a guarding 
> ZooKeeperServer.removeCnxn() against null zkDb:
> If zkDb is null, skip removal since no DB tracking exists yet.
> *Reproduction/Emulation*
>  
> Start ZooKeeper and simultaneously send repeated 4lw commands:
>  
> {code:java}
> while true; do printf ruok | nc localhost 2181; done{code}
>  
> During startup, intermittent NPEs are observed.
> *Testing*
> The issue is timing-dependent and difficult to reproduce deterministically in 
> a unit test.
> A simple and deterministic regression test is proposed that verifies 
> removeCnxn() is safe when zkDb == null.
> *Guilty stack*
> {code:java}
> 2026-05-13 10:11:06,025 [NIOWorkerThread-1] 
> org.apache.zookeeper.server.ServerCnxn : ERROR - [myid:] Error closing a 
> command socket 
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.zookeeper.server.ZKDatabase.removeCnxn(org.apache.zookeeper.server.ServerCnxn)"
>  because "this.zkDb" is null
>         at 
> org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:333)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:614)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:604)
>         at 
> org.apache.zookeeper.server.ServerCnxn.cleanupWriterSocket(ServerCnxn.java:593)
>         at 
> org.apache.zookeeper.server.command.AbstractFourLetterCommand.run(AbstractFourLetterCommand.java:60)
>         at 
> org.apache.zookeeper.server.command.AbstractFourLetterCommand.start(AbstractFourLetterCommand.java:51)
>         at 
> org.apache.zookeeper.server.command.CommandExecutor.execute(CommandExecutor.java:45)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.checkFourLetterWord(NIOServerCnxn.java:548)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:562)
>         at 
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:352)
>         at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
>         at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>         at java.base/java.lang.Thread.run(Thread.java:1583) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to