CRZbulabula opened a new pull request, #17895:
URL: https://github.com/apache/iotdb/pull/17895

   Replaces #17894 so the PR is opened from the apache/iotdb repository branch 
instead of a personal fork.
   
   ## Problem
   
   While benchmarking concurrent read/write and scaling in one healthy DataNode 
on a Ratis (3-replica) schema region, killing (`kill -9`) the DataNode that 
hosts an **ADDING** peer mid-migration leaves the schema-region `AddRegionPeer` 
task **stuck forever** (observed: `CHECK_ADD_REGION_PEER` / `WAITING` for 50+ 
minutes).
   
   ## Root cause
   
   Ratis reconfiguration (add/remove peer) requires every new staging peer to 
*catch up* before the leader can commit the new configuration. If the ADDING 
peer is killed and can never catch up:
   
   1. On the leader, the reconfiguration stays in the staging state and each 
`setConfiguration` attempt is rejected with 
`ReconfigurationInProgressException`.
   2. The Ratis client used for reconfiguration uses an **unbounded** retry 
policy (`retryForeverWithSleep(2s)` in the former `RatisEndlessRetryPolicy`), 
so it retries forever.
   3. The coordinator DataNode's `AddRegionPeerTask.run()` is a plain 
synchronous `Runnable` blocked inside `addRemotePeer -> setConfiguration`, with 
no cancel checkpoint. It therefore never returns and the task status stays 
`PROCESSING` indefinitely.
   
   Because the task never leaves `PROCESSING`, the existing **CANCEL ALL 
MIGRATIONS** mechanism (cooperative, only effective at a safe point) cannot 
recover this case either - there is no safe point to stop at.
   
   ## Fix
   
   Bound the reconfiguration retries instead of retrying forever. The former 
`RatisEndlessRetryPolicy` now uses:
   
   ```java
   RetryPolicies.retryUpToMaximumCountWithFixedSleep(
       config.getReconfigurationMaxRetryAttempts(),   // default 600
       TimeDuration.valueOf(2, TimeUnit.SECONDS));
   ```
   
   After the bound is exhausted, the last failure propagates 
(`ConsensusException`) so the upper layer (`AddRegionPeerProcedure`) can **fail 
and roll back** the migration - and `CANCEL` becomes reachable again, since the 
DataNode task finally leaves `PROCESSING`.
   
   The class/factory were renamed `EndlessRetryFactory` / 
`RatisEndlessRetryPolicy` -> `ReconfigurationRetryFactory` / 
`RatisReconfigurationRetryPolicy`, since the policy is no longer endless.
   
   ## New configuration
   
   Per-consensus-group, pushed from ConfigNode to DataNode via `TRatisConfig`:
   
   | property | default |
   | --- | --- |
   | `config_node_ratis_reconfiguration_max_retry_attempts` | 600 |
   | `schema_region_ratis_reconfiguration_max_retry_attempts` | 600 |
   | `data_region_ratis_reconfiguration_max_retry_attempts` | 600 |
   
   Reconfiguration retries use a fixed 2s interval, so the bound roughly caps 
the wait at `attempts x 2s` (600 ~= 20 min). Tune lower for faster failover, 
higher for very large regions.
   
   ## Compatibility (rolling upgrade)
   
   The two new `TRatisConfig` fields are `optional`. A ConfigNode running the 
old version will not set them, in which case the DataNode keeps its **local 
default** instead of overwriting it with `0` (guarded by `isSet...()` in 
`IoTDBDescriptor`).
   
   ## Verification
   
   - `mvn compile` passes for `iotdb-core/consensus`, `iotdb-core/confignode`, 
`iotdb-core/datanode` (thrift regenerated).
   - `spotless:check` is clean.
   
   ## Note for reviewers
   
   - The default `600` (~20 min) favors not regressing legitimate slow 
`addPeer` (large snapshot transfer) over fast failover. Happy to adjust - 
schema-region snapshots are small, so a smaller `schema_region_*` default (e.g. 
30-60) may be preferable.
   - An integration test reproducing the killed-ADDING-peer scenario can be 
added as a follow-up if desired.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to