CRZbulabula opened a new pull request, #17898:
URL: https://github.com/apache/iotdb/pull/17898
## Background
Follow-up to #17821 ("Improve ConfigNode leader warm-up before serving"),
addressing a review comment from @caideyipi.
## Problem
In `ConfigRegionStateMachine.becomeLeader()`, the parallel leader-service
startups are joined with a barrier:
```java
CompletableFuture.allOf(startups).join();
```
Each startup is wrapped by `startInParallelIfEpochCurrent()`, whose Javadoc
claims the returned future *"always completes normally so
`CompletableFuture#allOf` acts as a clean join barrier."* But the body ran
`startup.run()` **without any exception guard**:
```java
return CompletableFuture.runAsync(
() -> {
if (isCurrentLeaderServicesEpoch(epoch)) {
startup.run(); // unguarded
}
},
leaderServicesStartupPool);
```
If any of the parallel startups — CQ scheduler, pipe meta-sync/heartbeat,
subscription meta-sync, metrics, cluster-id check, etc. — throws a
`RuntimeException`, the corresponding future completes **exceptionally**.
`allOf(startups).join()` then rethrows it as a `CompletionException`, which
propagates out of `becomeLeader()` **before**
`ProcedureManager.startExecutor()` and
`markLeaderServicesReadyIfEpochCurrent()` run.
As a result `leaderServicesReady` is never set to `true`, and the ConfigNode
keeps returning `CONFIG_NODE_LEADER_WARMING_UP` to clients **indefinitely** —
the leader silently never becomes serviceable. The documented "always completes
normally" invariant was simply false.
## Fix
Make the invariant true by catching and logging each startup's failure
inside `startInParallelIfEpochCurrent()` so it can never escape into the join
barrier:
- Each startup now runs inside a `try/catch`; a failure is logged at `ERROR`
and swallowed, so a single misbehaving service can no longer abort the whole
transition.
- A failed service stays unstarted until the next leadership transition, but
the node still finishes warming up and begins serving — and the failure is
observable via the error log instead of a silent hang.
- To make that log actionable, each startup is paired with a human-readable
name through a small `LeaderServiceStartup` holder, so the log identifies
exactly which service failed.
- The `startInParallelIfEpochCurrent()` Javadoc is updated to describe the
actual (now-correct) behavior.
## Behavior change
| | Before | After |
|---|---|---|
| One parallel startup throws | `becomeLeader()` aborts; node stuck at
`CONFIG_NODE_LEADER_WARMING_UP` forever | Failure logged with the service name;
warm-up completes; node serves |
| Other startups | Their outcome irrelevant — transition already aborted |
Continue and complete normally |
## Testing
- `mvn spotless:apply -pl iotdb-core/confignode` — clean.
- `mvn compile -pl iotdb-core/confignode` — compiles successfully.
The change is a self-contained defensive guard around existing private
startup logic; no public surface or existing tests are affected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]