CRZbulabula commented on PR #17908:
URL: https://github.com/apache/iotdb/pull/17908#issuecomment-4678900296

   Thanks @Caideyipi, both points are valid — fixed in 6dd498e.
   
   **1. Re-execution loop on shutdown.** You're right that 
`setNextState(DO_ADD_REGION_PEER)` + `HAS_MORE_STATE` triggers the in-place 
`reExecute` path (`subprocs[0] == proc`), and that `store.isRunning()` is not a 
sufficient exit condition: `stopExecutor()` runs `executor.stop()` and 
`executor.join()` *before* `store.stop()`, so during shutdown the store is 
still running while `join()` waits for this very worker — the inner loop would 
keep re-submitting/re-polling and `join()` would hang. (I also checked throwing 
`InterruptedException` to yield instead: that doesn't work here, because after 
the `catch (InterruptedException)` in `executeProcedure` the procedure falls 
through to `setState(SUCCESS)`, which is exactly the false-success we're trying 
to avoid.)
   
   Fix at the framework layer: the inner loop now also checks the executor's 
own running flag — `if (!isRunning() || !store.isRunning()) return;` — so the 
worker exits cleanly when the executor is stopping, and the persisted state 
lets the next leader resume. Hardening at the procedure layer: the 
AddRegionPeerTask is now submitted only when `getCycles() == 0` (first entry) 
in addition to the existing `isStateDeserialized()` guard, so an in-place 
re-entry never re-submits.
   
   **2. The test could miss the PROCESSING branch.** Correct, and worse than 
theoretical — I checked the logs of the previous (green) CI run and there is 
**no** `returns PROCESSING` line, i.e. the kill point fired after submit but 
before `waitTaskFinish()`, AddPeer finished quickly, and the test passed via 
the SUCCESS branch without ever exercising the new code.
   
   Fix: I added a `RegionMaintainKillPoints.WAIT_TASK_FINISH_POLLING` kill 
point that fires from *inside* `waitTaskFinish()`, once the first poll confirms 
the task is still `PROCESSING` — i.e. when the worker is provably blocked in 
the wait. The graceful stop then deterministically interrupts 
`waitTaskFinish()`. The test registers this kill point instead of 
`DO_ADD_REGION_PEER`, and additionally asserts that a ConfigNode log contains 
`returns PROCESSING`, so it now fails loudly if the branch is skipped.
   
   I'll re-run the tests on the per-PR Cluster IT pipeline to confirm the 
PROCESSING branch is now actually hit before finalizing.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to