imbajin commented on code in PR #2997:
URL: https://github.com/apache/hugegraph/pull/2997#discussion_r3115720594
##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -86,6 +115,14 @@ public synchronized void open() {
builder.withSocketOptions(socketOptions);
+ // Reconnection policy: let driver keep retrying nodes in background
+ // with exponential backoff after they go down (see issue #2740).
+ long reconnectBase = config.get(
+ CassandraOptions.CASSANDRA_RECONNECT_BASE_DELAY);
Review Comment:
⚠️ **Duplicate config read for `reconnect_base_delay`**
`CASSANDRA_RECONNECT_BASE_DELAY` is read twice from config: once in the
constructor (line 72) to validate the `base ≤ max` constraint, and again here
in `open()` (line 120-121) to build the reconnection policy.
The constructor already stores `reconnectMax` as `this.retryMaxDelay`, but
`reconnectBase` is discarded after validation. This means:
1. `config.get()` is called redundantly on every `open()`.
2. If the config object were mutable (or in tests with different config
instances), the two reads could theoretically diverge.
**Suggestion:** store `reconnectBase` as a field alongside `retryMaxDelay`
in the constructor, then use `this.retryBaseDelay` here:
```java
// constructor
this.retryBaseDelay = reconnectBase; // add this field
this.retryMaxDelay = reconnectMax;
// open()
builder.withReconnectionPolicy(
new ExponentialReconnectionPolicy(this.retryBaseDelay,
this.retryMaxDelay));
```
##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -255,6 +355,61 @@ public boolean hasChanges() {
return this.batch.size() > 0;
}
+ /**
+ * Periodic liveness probe invoked by {@link BackendSessionPool} to
+ * recover thread-local sessions after Cassandra has been restarted.
+ * Reopens the driver session if it was closed and pings the cluster
+ * with a lightweight query. Any failure here is swallowed so the
+ * caller can still issue the real query, which will drive retries via
+ * {@link #executeWithRetry(Statement)}.
+ */
+ @Override
+ public void reconnectIfNeeded() {
+ if (!this.opened) {
+ return;
+ }
+ try {
+ if (this.session == null || this.session.isClosed()) {
+ this.session = null;
+ this.tryOpen();
+ }
+ if (this.session != null) {
+ this.session.execute(new
SimpleStatement(HEALTH_CHECK_CQL));
+ }
+ } catch (DriverException e) {
+ LOG.debug("Cassandra health-check failed, " +
+ "will retry on next query: {}", e.getMessage());
+ } finally {
+ // Keep opened flag consistent with session: if tryOpen()
+ // failed to reopen, clear opened so the next execute() does
+ // not NPE before executeWithRetry() can intercept.
+ if (this.session == null) {
+ this.opened = false;
Review Comment:
‼️ **Mutating `this.opened` in `reconnectIfNeeded()` breaks parent-class
lifecycle invariants**
`opened` is a field of `AbstractBackendSession` that tracks the ref-counted
lifecycle managed by `BackendSessionPool`. Setting it to `false` here bypasses
that contract.
The normal teardown flow is:
```
BackendSessionPool.closeSession()
→ session.detach() // decrements ref
→ if ref == 0: session.close()
→ threadLocalSession.remove()
→ sessions.remove(threadId)
→ sessionCount.decrementAndGet()
```
If `this.opened = false` is set here instead, the session object stays
registered in `sessions` and `sessionCount` is never decremented. On the next
`getOrNewSession()` call the pool sees `opened == false` and creates a new
session, leaking the old entry.
Additionally, `opened()` already guards against a null session:
```java
@Override
public boolean opened() {
if (this.opened && this.session == null) {
this.tryOpen(); // already handles the re-open
}
return this.opened && this.session != null;
}
```
So the NPE concern in the `finally` comment is already handled by
`opened()`. The `finally` block can be removed entirely — just catch the
exception and log, letting the normal `opened()` / `executeWithRetry()` path
handle the rest.
##########
hugegraph-server/hugegraph-cassandra/src/main/java/org/apache/hugegraph/backend/store/cassandra/CassandraSessionPool.java:
##########
@@ -174,6 +211,7 @@ public void commitAsync() {
int processors = Math.min(statements.size(), 1023);
List<ResultSetFuture> results = new ArrayList<>(processors + 1);
for (Statement s : statements) {
+ // TODO: commitAsync is not retried (async retry semantics are
complex)
Review Comment:
‼️ **`commitAsync()` has no retry protection — inconsistent write
reliability**
`commit()` (sync path) wraps all statements in `executeWithRetry`, but
`commitAsync()` calls `this.session.executeAsync(s)` directly. This means write
operations on the async path get no protection against transient
`NoHostAvailableException` / `OperationTimedOutException` — exactly the
failures this PR targets.
Timeline during a Cassandra restart:
```
Thread A (sync): commit() → executeWithRetry() → retries → succeeds ✓
Thread B (async): commitAsync() → executeAsync() → NoHostAvailableException
✗
```
Both paths commit the same batch type; users cannot know which one they're
calling or whether it will be protected.
If implementing async retry is genuinely deferred, the TODO should be more
explicit about the consequence — callers may see write failures during a
Cassandra restart even with `maxRetries > 0` configured. Consider at minimum
logging a warning at startup when `maxRetries > 0`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]