CRZbulabula opened a new pull request, #17894:
URL: https://github.com/apache/iotdb/pull/17894
## Problem
While benchmarking concurrent read/write and scaling in one healthy DataNode
on a Ratis (3-replica) schema region, killing (`kill -9`) the DataNode that
hosts an **ADDING** peer mid-migration leaves the schema-region `AddRegionPeer`
task **stuck forever** (observed: `CHECK_ADD_REGION_PEER` / `WAITING` for 50+
minutes).
## Root cause
Ratis reconfiguration (add/remove peer) requires every new staging peer to
*catch up* before the leader can commit the new configuration. If the ADDING
peer is killed and can never catch up:
1. On the leader, the reconfiguration stays in the staging state and each
`setConfiguration` attempt is rejected with
`ReconfigurationInProgressException`.
2. The Ratis client used for reconfiguration uses an **unbounded** retry
policy (`retryForeverWithSleep(2s)` in the former `RatisEndlessRetryPolicy`),
so it retries forever.
3. The coordinator DataNode's `AddRegionPeerTask.run()` is a plain
synchronous `Runnable` blocked inside `addRemotePeer → setConfiguration`, with
no cancel checkpoint. It therefore never returns and the task status stays
`PROCESSING` indefinitely.
Because the task never leaves `PROCESSING`, the existing **CANCEL ALL
MIGRATIONS** mechanism (cooperative, only effective at a safe point) cannot
recover this case either — there is no safe point to stop at.
## Fix
Bound the reconfiguration retries instead of retrying forever. The former
`RatisEndlessRetryPolicy` now uses:
```java
RetryPolicies.retryUpToMaximumCountWithFixedSleep(
config.getReconfigurationMaxRetryAttempts(), // default 600
TimeDuration.valueOf(2, TimeUnit.SECONDS));
```
After the bound is exhausted, the last failure propagates
(`ConsensusException`) so the upper layer (`AddRegionPeerProcedure`) can **fail
and roll back** the migration — and `CANCEL` becomes reachable again, since the
DataNode task finally leaves `PROCESSING`.
The class/factory were renamed `EndlessRetryFactory` /
`RatisEndlessRetryPolicy` → `ReconfigurationRetryFactory` /
`RatisReconfigurationRetryPolicy`, since the policy is no longer endless.
## New configuration
Per-consensus-group, pushed from ConfigNode to DataNode via `TRatisConfig`:
| property | default |
| --- | --- |
| `config_node_ratis_reconfiguration_max_retry_attempts` | 600 |
| `schema_region_ratis_reconfiguration_max_retry_attempts` | 600 |
| `data_region_ratis_reconfiguration_max_retry_attempts` | 600 |
Reconfiguration retries use a fixed 2s interval, so the bound roughly caps
the wait at `attempts × 2s` (600 ≈ 20 min). Tune lower for faster failover,
higher for very large regions.
## Compatibility (rolling upgrade)
The two new `TRatisConfig` fields are `optional`. A ConfigNode running the
old version will not set them, in which case the DataNode keeps its **local
default** instead of overwriting it with `0` (guarded by `isSet...()` in
`IoTDBDescriptor`).
## Verification
- `mvn compile` passes for `iotdb-core/consensus`, `iotdb-core/confignode`,
`iotdb-core/datanode` (thrift regenerated).
- `spotless:check` is clean.
## Note for reviewers
- The default `600` (~20 min) favors not regressing legitimate slow
`addPeer` (large snapshot transfer) over fast failover. Happy to adjust —
schema-region snapshots are small, so a smaller `schema_region_*` default (e.g.
30–60) may be preferable.
- An integration test reproducing the killed-ADDING-peer scenario can be
added as a follow-up if desired.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]