imbajin commented on code in PR #2997:
URL: https://github.com/apache/hugegraph/pull/2997#discussion_r3105727865


##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -197,15 +234,59 @@ public ResultSet query(Statement statement) {
         }
 
         public ResultSet execute(Statement statement) {
-            return this.session.execute(statement);
+            return this.executeWithRetry(statement);
         }
 
         public ResultSet execute(String statement) {
-            return this.session.execute(statement);
+            return this.executeWithRetry(new SimpleStatement(statement));
         }
 
         public ResultSet execute(String statement, Object... args) {
-            return this.session.execute(statement, args);
+            return this.executeWithRetry(new SimpleStatement(statement, args));
+        }
+
+        /**
+         * Execute a statement, retrying on transient connectivity failures
+         * (NoHostAvailableException / OperationTimedOutException). The driver
+         * itself keeps retrying connections in the background via the
+         * reconnection policy, so once Cassandra comes back online, a
+         * subsequent attempt here will succeed without restarting the server.
+         * See issue #2740.
+         */
+        private ResultSet executeWithRetry(Statement statement) {
+            int retries = CassandraSessionPool.this.maxRetries;

Review Comment:
   ‼️ **Critical: BackendException argument order is wrong — `lastError` is 
passed as the first vararg instead of as the `cause`**
   
   The constructor signature being called here is `BackendException(String 
message, Object... args)`, which means `lastError` is used as a format argument 
(`%s`), **not** as the exception cause. The original exception's stack trace is 
lost entirely.
   
   The format string has two `%s` placeholders but the args are `(lastError, 
retries, message)` — three args. The first `%s` will be filled by 
`lastError.toString()` (the exception object), the second `%s` by `retries` (an 
int). The `message` string becomes a dangling unused arg.
   
   To preserve the cause chain **and** format the message correctly, use the 
`(String, Throwable, Object...)` constructor:
   
   ```suggestion
               throw new BackendException("Failed to execute Cassandra query " +
                                          "after %s retries: %s",
                                          lastError, retries,
                                          lastError == null ? "<null>" :
                                                  lastError.getMessage());
   ```
   
   Should be:
   
   ```java
   throw new BackendException(
       "Failed to execute Cassandra query after %s retries: %s",
       lastError, retries,
       lastError == null ? "<null>" : lastError.getMessage());
   ```
   
   Wait — looking more carefully at the available constructors:
   - `BackendException(String message, Throwable cause, Object... args)` — this 
is what you need.
   
   The fix:
   ```java
   throw new BackendException(
       "Failed to execute Cassandra query after %s retries",
       lastError, retries);
   ```
   
   This passes `lastError` as the `cause` (Throwable) and `retries` as the 
format arg. The cause's message is already accessible via 
`getCause().getMessage()`, so repeating it in the format string is unnecessary.



##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -255,6 +336,53 @@ public boolean hasChanges() {
             return this.batch.size() > 0;
         }
 
+        /**
+         * Periodic liveness probe invoked by {@link BackendSessionPool} to
+         * recover thread-local sessions after Cassandra has been restarted.
+         * Reopens the driver session if it was closed and pings the cluster
+         * with a lightweight query. Any failure here is swallowed so the
+         * caller can still issue the real query, which will drive retries via
+         * {@link #executeWithRetry(Statement)}.
+         */

Review Comment:
   ⚠️ **`reconnectIfNeeded()` checks `this.opened` but never sets it to `false` 
on failure**
   
   When the health check (`HEALTH_CHECK_CQL`) fails, the session stays in 
`opened = true` state. On the next call to `detectSession()` from 
`BackendSessionPool`, the idle-time check (`now - session.updated() > 
interval`) may skip `reconnectIfNeeded()` entirely because `session.update()` 
was called right after.
   
   This means if the health check catches a failure, the session is marked as 
"recently updated" but is actually broken — and `reconnectIfNeeded()` won't be 
called again until the detect interval elapses.
   
   Also: if `tryOpen()` fails inside `reconnectIfNeeded()`, the method silently 
returns. But `this.opened` was already `true` (the guard at the top), and 
`this.session` is now `null`. Any subsequent `execute()` call will NPE on 
`this.session.execute(...)` before `executeWithRetry` can even catch a 
`DriverException`.
   
   Consider setting `this.opened = false` when the session is nulled out, so 
that `tryOpen()` is properly re-triggered on the next access.



##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -197,15 +234,59 @@ public ResultSet query(Statement statement) {
         }
 
         public ResultSet execute(Statement statement) {
-            return this.session.execute(statement);
+            return this.executeWithRetry(statement);
         }
 
         public ResultSet execute(String statement) {
-            return this.session.execute(statement);
+            return this.executeWithRetry(new SimpleStatement(statement));
         }
 
         public ResultSet execute(String statement, Object... args) {
-            return this.session.execute(statement, args);
+            return this.executeWithRetry(new SimpleStatement(statement, args));
+        }
+
+        /**

Review Comment:
   🧹 **Minor: `Thread.sleep()` in `executeWithRetry` blocks the calling thread**
   
   This is acceptable for low-QPS scenarios, but for high-throughput workloads 
with many concurrent queries hitting a downed Cassandra, all request threads 
will pile up sleeping here (up to 10 retries × 60s max delay = 10 minutes worst 
case per thread).
   
   Consider whether the retry count and max delay defaults might be too 
aggressive for production. A thread blocked for minutes can cascade into thread 
pool exhaustion. An alternative would be to fail faster (e.g., 3 retries with 
shorter delays) and let the caller/user retry at a higher level.
   
   Not a blocker — just worth considering for the default values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to