Re: [PR] fix(cassandra): auto-recover session after Cassandra restart [hugegraph]

via GitHub Tue, 21 Apr 2026 00:22:59 -0700


imbajin commented on code in PR #2997:
URL: https://github.com/apache/hugegraph/pull/2997#discussion_r3115720594



##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -86,6 +115,14 @@ public synchronized void open() {
 
         builder.withSocketOptions(socketOptions);
 
+        // Reconnection policy: let driver keep retrying nodes in background
+        // with exponential backoff after they go down (see issue #2740).
+        long reconnectBase = config.get(
+                CassandraOptions.CASSANDRA_RECONNECT_BASE_DELAY);

Review Comment:
   ⚠️ **Duplicate config read for `reconnect_base_delay`**
   
   `CASSANDRA_RECONNECT_BASE_DELAY` is read twice from config: once in the 
constructor (line 72) to validate the `base ≤ max` constraint, and again here 
in `open()` (line 120-121) to build the reconnection policy.
   
   The constructor already stores `reconnectMax` as `this.retryMaxDelay`, but 
`reconnectBase` is discarded after validation. This means:
   1. `config.get()` is called redundantly on every `open()`.
   2. If the config object were mutable (or in tests with different config 
instances), the two reads could theoretically diverge.
   
   **Suggestion:** store `reconnectBase` as a field alongside `retryMaxDelay` 
in the constructor, then use `this.retryBaseDelay` here:
   
   ```java
   // constructor
   this.retryBaseDelay = reconnectBase;   // add this field
   this.retryMaxDelay  = reconnectMax;
   
   // open()
   builder.withReconnectionPolicy(
           new ExponentialReconnectionPolicy(this.retryBaseDelay,
                                             this.retryMaxDelay));
   ```
   



##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -255,6 +355,61 @@ public boolean hasChanges() {
             return this.batch.size() > 0;
         }
 
+        /**
+         * Periodic liveness probe invoked by {@link BackendSessionPool} to
+         * recover thread-local sessions after Cassandra has been restarted.
+         * Reopens the driver session if it was closed and pings the cluster
+         * with a lightweight query. Any failure here is swallowed so the
+         * caller can still issue the real query, which will drive retries via
+         * {@link #executeWithRetry(Statement)}.
+         */
+        @Override
+        public void reconnectIfNeeded() {
+            if (!this.opened) {
+                return;
+            }
+            try {
+                if (this.session == null || this.session.isClosed()) {
+                    this.session = null;
+                    this.tryOpen();
+                }
+                if (this.session != null) {
+                    this.session.execute(new 
SimpleStatement(HEALTH_CHECK_CQL));
+                }
+            } catch (DriverException e) {
+                LOG.debug("Cassandra health-check failed, " +
+                          "will retry on next query: {}", e.getMessage());
+            } finally {
+                // Keep opened flag consistent with session: if tryOpen()
+                // failed to reopen, clear opened so the next execute() does
+                // not NPE before executeWithRetry() can intercept.
+                if (this.session == null) {
+                    this.opened = false;

Review Comment:
   ‼️ **Mutating `this.opened` in `reconnectIfNeeded()` breaks parent-class 
lifecycle invariants**
   
   `opened` is a field of `AbstractBackendSession` that tracks the ref-counted 
lifecycle managed by `BackendSessionPool`. Setting it to `false` here bypasses 
that contract.
   
   The normal teardown flow is:
   ```
   BackendSessionPool.closeSession()
     → session.detach()         // decrements ref
     → if ref == 0: session.close()
     → threadLocalSession.remove()
     → sessions.remove(threadId)
     → sessionCount.decrementAndGet()
   ```
   
   If `this.opened = false` is set here instead, the session object stays 
registered in `sessions` and `sessionCount` is never decremented. On the next 
`getOrNewSession()` call the pool sees `opened == false` and creates a new 
session, leaking the old entry.
   
   Additionally, `opened()` already guards against a null session:
   ```java
   @Override
   public boolean opened() {
       if (this.opened && this.session == null) {
           this.tryOpen();           // already handles the re-open
       }
       return this.opened && this.session != null;
   }
   ```
   So the NPE concern in the `finally` comment is already handled by 
`opened()`. The `finally` block can be removed entirely — just catch the 
exception and log, letting the normal `opened()` / `executeWithRetry()` path 
handle the rest.
   



##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -174,6 +211,7 @@ public void commitAsync() {
             int processors = Math.min(statements.size(), 1023);
             List<ResultSetFuture> results = new ArrayList<>(processors + 1);
             for (Statement s : statements) {
+                // TODO: commitAsync is not retried (async retry semantics are 
complex)

Review Comment:
   ‼️ **`commitAsync()` has no retry protection — inconsistent write 
reliability**
   
   `commit()` (sync path) wraps all statements in `executeWithRetry`, but 
`commitAsync()` calls `this.session.executeAsync(s)` directly. This means write 
operations on the async path get no protection against transient 
`NoHostAvailableException` / `OperationTimedOutException` — exactly the 
failures this PR targets.
   
   Timeline during a Cassandra restart:
   
   ```
   Thread A (sync):   commit() → executeWithRetry() → retries → succeeds ✓
   Thread B (async):  commitAsync() → executeAsync() → NoHostAvailableException 
✗
   ```
   
   Both paths commit the same batch type; users cannot know which one they're 
calling or whether it will be protected.
   
   If implementing async retry is genuinely deferred, the TODO should be more 
explicit about the consequence — callers may see write failures during a 
Cassandra restart even with `maxRetries > 0` configured. Consider at minimum 
logging a warning at startup when `maxRetries > 0`.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(cassandra): auto-recover session after Cassandra restart [hugegraph]

Reply via email to