imbajin commented on code in PR #2997:
URL: https://github.com/apache/hugegraph/pull/2997#discussion_r3115721646
##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -197,15 +235,77 @@ public ResultSet query(Statement statement) {
}
public ResultSet execute(Statement statement) {
- return this.session.execute(statement);
+ return this.executeWithRetry(statement);
}
public ResultSet execute(String statement) {
- return this.session.execute(statement);
+ return this.executeWithRetry(new SimpleStatement(statement));
}
public ResultSet execute(String statement, Object... args) {
- return this.session.execute(statement, args);
+ return this.executeWithRetry(new SimpleStatement(statement, args));
+ }
+
+ /**
+ * Execute a statement, retrying on transient connectivity failures
+ * (NoHostAvailableException / OperationTimedOutException). The driver
+ * itself keeps retrying connections in the background via the
+ * reconnection policy, so once Cassandra comes back online, a
+ * subsequent attempt here will succeed without restarting the server.
+ *
+ * <p><b>Blocking note:</b> retries block the calling thread via
+ * {@link Thread#sleep(long)}. Worst-case a single call blocks for
+ * {@code maxRetries * retryMaxDelay} ms. Under high-throughput
+ * workloads concurrent threads may pile up in {@code sleep()} during
+ * a Cassandra outage. For such deployments lower
+ * {@code cassandra.reconnect_max_retries} (default 10) and
+ * {@code cassandra.reconnect_max_delay} (default 60000ms) so the
+ * request fails fast and pressure is released back to the caller.
+ */
+ private ResultSet executeWithRetry(Statement statement) {
Review Comment:
⚠️ **Default settings allow a single query call to block for up to ~7
minutes**
With the defaults `maxRetries = 10`, `reconnect_interval = 5000 ms`, and
`reconnect_max_delay = 60 000 ms`, the worst-case wall time for one
`executeWithRetry()` call is:
```
attempt 0 → sleep(min(5000 * 1, 60000)) = 5 000 ms
attempt 1 → sleep(min(5000 * 2, 60000)) = 10 000 ms
attempt 2 → sleep(min(5000 * 4, 60000)) = 20 000 ms
attempt 3 → sleep(min(5000 * 8, 60000)) = 40 000 ms
attempts 4–9 → sleep(60 000 ms each) = 360 000 ms
──────────
total blocking time ≈ 435 seconds (~7 min)
```
During a Cassandra outage every in-flight thread sleeps through this
sequence. On a busy server, hundreds of threads pile up in `Thread.sleep()`,
exhausting the thread pool well before `maxRetries` is hit.
Consider lowering the defaults so the query-level retry is a short hiccup
buffer, while the driver-level `ExponentialReconnectionPolicy` handles the
actual node reconnection in the background:
| Option | Current default | Suggested |
|--------|----------------|-----------|
| `reconnect_max_retries` | 10 | 3 |
| `reconnect_interval` | 5 000 ms | 1 000 ms |
| `reconnect_max_delay` | 60 000 ms | 10 000 ms |
##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -255,6 +355,61 @@ public boolean hasChanges() {
return this.batch.size() > 0;
}
+ /**
+ * Periodic liveness probe invoked by {@link BackendSessionPool} to
+ * recover thread-local sessions after Cassandra has been restarted.
+ * Reopens the driver session if it was closed and pings the cluster
+ * with a lightweight query. Any failure here is swallowed so the
+ * caller can still issue the real query, which will drive retries via
+ * {@link #executeWithRetry(Statement)}.
+ */
+ @Override
+ public void reconnectIfNeeded() {
+ if (!this.opened) {
+ return;
+ }
+ try {
+ if (this.session == null || this.session.isClosed()) {
+ this.session = null;
+ this.tryOpen();
+ }
+ if (this.session != null) {
+ this.session.execute(new
SimpleStatement(HEALTH_CHECK_CQL));
Review Comment:
⚠️ **Health-check failure silently leaves the dead session in place**
When `session.execute(HEALTH_CHECK_CQL)` throws a `DriverException`, the
exception is caught and logged at DEBUG — but the `session` object that
produced the failure is kept. The probe's purpose is to detect a dead session
early; if it fails the session should be marked unhealthy immediately.
Current flow:
```
health-check → DriverException
→ LOG.debug(...)
→ session is NOT cleared
→ next executeWithRetry() tries the same dead session
→ pays the full retry cost anyway
```
Desired flow:
```
health-check → DriverException
→ session = null ← mark unhealthy now
→ opened() will call tryOpen() before the next real query
```
Suggested fix:
```java
} catch (DriverException e) {
LOG.debug("Cassandra health-check failed, resetting session: {}",
e.getMessage());
this.session = null; // force re-open on next query via opened()
}
```
This also makes the `finally` block that sets `this.opened = false`
unnecessary (see separate comment on that).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]