[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-5011:
--------------------------------------
    Labels: pull-request-available  (was: )

> FileTxnSnapLog throws uninformative NPE when accessed via stale reference 
> after server restart
> ----------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-5011
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5011
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.9.4
>            Reporter: ConfX
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In scenarios involving server restarts or failovers, application code may 
> inadvertently hold a stale reference to a `ZooKeeperServer` instance whose 
> `FileTxnSnapLog` has been closed. When methods like `save()` or `restore()` 
> are called through such stale references, the current implementation throws a 
> raw `NullPointerException` with no message, making it extremely difficult to 
> diagnose the root cause.
>  
> The current behavior forces developers to trace through the code to 
> understand that the NPE is caused by the resource being closed, rather than 
> immediately indicating the problem.
> h2. To reproduce:
> 1. Application code obtains a reference to `ZooKeeperServer` (e.g., `server = 
> factory.getZooKeeperServer()`)
> 2. Server undergoes a restart (graceful shutdown + new instance creation)
> 3. The old server's `FileTxnSnapLog.close()` is called during shutdown, 
> setting internal fields to `null`
> 4. Application code still holds the stale reference to the old server
> 5. When `server.takeSnapshot()` is called, it invokes `FileTxnSnapLog.save()` 
> on the closed instance
> 6. Result: Uninformative `NullPointerException` with no indication of the 
> actual problem
> {code:java}
> java.lang.NullPointerException
>     at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:482)
>     at 
> org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:575)
>     ... {code}
> When `close()` is called, it sets internal fields to `null` without any 
> tracking:
> {code:java}
> // FileTxnSnapLog.java:623-634
> public void close() throws IOException {
>     TxnLog txnLogToClose = txnLog;
>     if (txnLogToClose != null) {
>         txnLogToClose.close();
>     }
>     txnLog = null;  // Set to null, no closed flag
>     SnapShot snapSlogToClose = snapLog;
>     if (snapSlogToClose != null) {
>         snapSlogToClose.close();
>     }
>     snapLog = null;  // Set to null, no closed flag
> }{code}
>  
> Subsequently, methods like `save()` access these fields without null checks:
> {code:java}
> // FileTxnSnapLog.java:482
> snapLog.serialize(dataTree, sessionsWithTimeouts, snapshotFile, syncSnap);
> // NPE thrown here if snapLog is null {code}
> h2. Proposed Apporach:
> When a closed `FileTxnSnapLog` is accessed (regardless of the reason), it 
> should throw an `IllegalStateException` with a clear message:
> {code:java}
> java.lang.IllegalStateException: FileTxnSnapLog has been closed
>     at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:XXX)
>     ... {code}
> So we can add a `closed` flag and a helper method to provide informative 
> error messages:
> {code:java}
>     // ... existing fields ...
>     private volatile boolean closed = false;    /**
>      * Throws IllegalStateException if this FileTxnSnapLog has been closed.
>      * This helps diagnose issues with stale references after server restarts.
>      */
>     private void checkNotClosed() {
>         if (closed) {
>             throw new IllegalStateException(
>                 "FileTxnSnapLog has been closed. " +
>                 "This may indicate a stale reference to a stopped server 
> instance.");
>         }
>     }    
>     public void close() throws IOException {
>         if (closed) {
>             return;  // Already closed, make idempotent
>         }
>         closed = true;        TxnLog txnLogToClose = txnLog;
>         if (txnLogToClose != null) {
>             txnLogToClose.close();
>         }
>         txnLog = null;
>         SnapShot snapSlogToClose = snapLog;
>         if (snapSlogToClose != null) {
>             snapSlogToClose.close();
>         }
>         snapLog = null;
>     }    
>     public File save(
>         DataTree dataTree,
>         ConcurrentHashMap<Long, Integer> sessionsWithTimeouts,
>         boolean syncSnap) throws IOException {
>         checkNotClosed();  // Add this check
>         long lastZxid = dataTree.lastProcessedZxid;
>         // ... rest of method unchanged
>     }    // Similar checkNotClosed() calls added to other public methods 
> {code}
> I'm happy to send a PR for this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to