CRZbulabula commented on PR #17908: URL: https://github.com/apache/iotdb/pull/17908#issuecomment-4678900296
Thanks @Caideyipi, both points are valid — fixed in 6dd498e. **1. Re-execution loop on shutdown.** You're right that `setNextState(DO_ADD_REGION_PEER)` + `HAS_MORE_STATE` triggers the in-place `reExecute` path (`subprocs[0] == proc`), and that `store.isRunning()` is not a sufficient exit condition: `stopExecutor()` runs `executor.stop()` and `executor.join()` *before* `store.stop()`, so during shutdown the store is still running while `join()` waits for this very worker — the inner loop would keep re-submitting/re-polling and `join()` would hang. (I also checked throwing `InterruptedException` to yield instead: that doesn't work here, because after the `catch (InterruptedException)` in `executeProcedure` the procedure falls through to `setState(SUCCESS)`, which is exactly the false-success we're trying to avoid.) Fix at the framework layer: the inner loop now also checks the executor's own running flag — `if (!isRunning() || !store.isRunning()) return;` — so the worker exits cleanly when the executor is stopping, and the persisted state lets the next leader resume. Hardening at the procedure layer: the AddRegionPeerTask is now submitted only when `getCycles() == 0` (first entry) in addition to the existing `isStateDeserialized()` guard, so an in-place re-entry never re-submits. **2. The test could miss the PROCESSING branch.** Correct, and worse than theoretical — I checked the logs of the previous (green) CI run and there is **no** `returns PROCESSING` line, i.e. the kill point fired after submit but before `waitTaskFinish()`, AddPeer finished quickly, and the test passed via the SUCCESS branch without ever exercising the new code. Fix: I added a `RegionMaintainKillPoints.WAIT_TASK_FINISH_POLLING` kill point that fires from *inside* `waitTaskFinish()`, once the first poll confirms the task is still `PROCESSING` — i.e. when the worker is provably blocked in the wait. The graceful stop then deterministically interrupts `waitTaskFinish()`. The test registers this kill point instead of `DO_ADD_REGION_PEER`, and additionally asserts that a ConfigNode log contains `returns PROCESSING`, so it now fails loudly if the branch is skipped. I'll re-run the tests on the per-PR Cluster IT pipeline to confirm the PROCESSING branch is now actually hit before finalizing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
