CRZbulabula commented on PR #17898:
URL: https://github.com/apache/iotdb/pull/17898#issuecomment-4668475325

   @Caideyipi Thanks, this is a fair concern and I want to be explicit about 
the trade-off I'm making.
   
   You're right that this turns every parallel startup into best-effort and 
still marks `leaderServicesReady` true, so in the worst case `confirmLeader()` 
can return SUCCESS while one leader-only service (pipe/CQ/subscription) stays 
disabled until the next leadership transition. I considered the three options 
you listed (required-vs-optional gating, keep-warming on required failure, 
retry/health gating) but I think for this PR best-effort is actually the right 
line, for a few reasons:
   
   1. **These startups almost never fail.** They are the same activation calls 
that already ran successfully on every previous leadership transition; a 
`RuntimeException` here is a genuinely exceptional event (a bug or hard 
environmental fault), not an expected transient condition. So we're optimizing 
the handling of a near-zero-probability path.
   
   2. **There is no good in-place recovery handle.** If `startCQScheduler()` / 
`startPipeMetaSync()` / `startSubscriptionMetaSync()` throws, none of them 
expose an idempotent "try again" entry point that's safe to call mid-epoch — 
re-running them isn't a free retry, and a retry/health-gating loop would have 
to reason about partial state each one left behind. Building that recovery 
machinery is a much larger change than the bug this PR fixes, and it would add 
real complexity to the warm-up path for a case that essentially never fires.
   
   3. **`leaderServicesReady` blocking forever is strictly worse than 
best-effort serving.** The pre-fix behavior is: one throw aborts 
`becomeLeader()` *before* `markLeaderServicesReadyIfEpochCurrent()`, so the 
node returns `CONFIG_NODE_LEADER_WARMING_UP` **indefinitely** — the entire 
ConfigNode is unserviceable, including all the leader functionality that 
started fine, with no log pointing at the cause. The 
keep-warming-on-required-failure option you mention reproduces exactly this 
hang for required services; it just makes it intentional. Isolating the blast 
radius so one failed service can't take down the whole leader — and surfacing 
it via an ERROR log naming the service — is already a large improvement over a 
silent total stall.
   
   So the design intent of this PR is deliberately narrow: **make the 
documented "always completes normally" invariant true, and make a failure 
observable instead of silent**, rather than try to auto-heal a service that we 
have no clean way to restart anyway.
   
   If we do want required-vs-optional distinction or retry/health gating later, 
I'd rather do it as a focused follow-up that also gives each service a real 
idempotent restart path (so retry is meaningful) and a health endpoint to gate 
on — otherwise we're gating readiness on a flag we can't recover. Happy to file 
that as a separate issue. Does scoping it that way sound reasonable to you?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to