mwkang commented on issue #8590: URL: https://github.com/apache/storm/issues/8590#issuecomment-4436047615
Thanks @rzo1 — really appreciate the careful read. The eight points are all fair and several of them catch things I would have shipped as-is. Quick inline replies below; I'll fold the doc / PR-description items into the PR write-up and the code items into the patch itself. ### 1. RAS scope Intentionally out of scope for this change. `ResourceAwareScheduler` calls `Cluster.needsSchedulingRas` (a separate method), and the new opt-in branch lives in `Cluster.needsScheduling`, so RAS keeps its existing trigger and never enters `EvenScheduler.scheduleTopologiesEvenly`. A parallel mechanism for RAS would belong in a follow-up — the placement engine is different enough that it warrants its own design. I'll state this explicitly in the PR description so RAS operators aren't left guessing. ### 2. "Relocation = worker JVM restart" note Agreed, this is missing. I'll add a **"When you would NOT want to enable this"** section to the docs covering: - topologies with windowed / stateful bolts that pay non-trivial replay cost, - topologies sensitive to JIT warmup, - clusters whose supervisors flap (see (5)), - RAS users (no effect, see (1)). The proposal sells the upside; the docs should sell the downside just as clearly. ### 3. Isolation / Multitenant interaction Already safe, but I agree it should be explicit and tested. - **`IsolationScheduler`**: blacklists isolated hosts before delegating leftover topologies to `DefaultScheduler.defaultSchedule`, which calls `EvenScheduler.scheduleTopologiesEvenly`. Both `hasIdleSupervisorReusableBy` and `redistributeOntoIdleSupervisors` skip blacklisted supervisors, so an isolated host can never be a donor *or* a target. The "isolated topology is down → host looks idle" case is covered by the same blacklist check. - **`MultitenantScheduler`** doesn't route through `EvenScheduler.scheduleTopologiesEvenly` at all (it uses its own `DefaultPool` / `IsolatedPool`), so the new branch is unreachable from it. I'll add two explicit tests: - IsolationScheduler + idle non-isolated supervisor → relocation only outside the isolated set. - IsolationScheduler with the isolated topology down → the isolation-reserved host is not treated as idle. ### 4. "Most-loaded supervisor" — measure and tie-break Today: - **Measure**: worker count per supervisor for that topology (not executor count). Workers are the unit being relocated, so worker count is the natural metric. - **Tie-break**: currently relies on `HashMap` iteration order — which is not deterministic. That's a real operator-observability bug, good catch. Fix: deterministic secondary key (supervisor id, lexicographic) with a test pinning the choice. PR description will spell out the rule so operators can reason about which worker moved and why. ### 5. Flap guard Convincing. The binary trigger blocks *thrash*, but not the case where a supervisor returns, absorbs workers, and then disappears 30s later. I'll add a third opt-in knob: ``` nimbus.even.rebalance.idle.supervisor.min.stable.rounds (default: 3) ``` A supervisor that has been present-and-idle for fewer than N consecutive scheduling rounds will not be considered eligible. Gated independently of the master switch so operators can tune it without re-evaluating the whole feature. Implementation note: `SupervisorInfo.uptime_secs` is already available on the thrift side, so the cleanest path is to surface it on `SupervisorDetails` and gate `hasIdleSupervisorReusableBy` on `uptime >= minStableRounds * supervisor.monitor.frequency.secs`. No new state machine needed. ### 6. `needsScheduling` call-path audit Verified across the tree: | Caller | Method used | Affected? | |---|---|---| | `DefaultScheduler.defaultSchedule` | `needsSchedulingTopologies` → `needsScheduling` | Yes, but gated by the (default-off) flag | | `EvenScheduler.scheduleTopologies` | `needsSchedulingTopologies` → `needsScheduling` | Yes, same gating | | `ResourceAwareScheduler` | `needsSchedulingRas` | **No** — separate method | | `IsolationScheduler` | Indirect, via leftover topologies through `DefaultScheduler.defaultSchedule` | Reached but neutralized — blacklist masks the isolated set (see (3)) | | `MultitenantScheduler` | Doesn't go through `needsScheduling` | **No** | I'll include this table in the PR description. ### 7. Config key naming Will fix. Renaming to dot-only: - `nimbus.even.rebalance.idle-supervisor.enabled` → `nimbus.even.rebalance.idle.supervisor.enabled` - `nimbus.even.rebalance.max-free-per-topology` → `nimbus.even.rebalance.max.free.per.topology` - `nimbus.even.rebalance.idle.supervisor.min.stable.rounds` *(new, from (5))* For reference, `conf/defaults.yaml` has ~280 dotted-only keys versus a single dash-containing one (`storm.auth.simple-white-list.users`), so dot-only is clearly the house style. Easier to fix now, before docs and release notes lock it in. ### 8. Blast radius Your reading is correct. In one scheduling pass the simultaneous restart count is ``` min(idle_slots, eligible_topologies) ``` with the per-topology contribution capped by ``` floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount ``` (further tightened by `max.free.per.topology` when set to a positive value). I'll lift your worked example into the docs verbatim: "one returning supervisor with 8 slots in a 50-topology cluster → 8 simultaneous worker restarts across 8 topologies in one pass." That's the number operators need to see before opting in. Happy to add a cluster-wide ceiling on top of the per-topology one — something like `nimbus.even.rebalance.max.relocations.per.round` — if reviewers want a hard cap independent of topology shape. Let me know your preference; I'd lean toward adding it (cheap, conservative-by-default) but don't want to bolt on knobs that aren't pulling weight. --- Plan from here: I'll fold (2), (4), (5), (7), (8), the audit table from (6), and the two new tests from (3) into the PR and open it. Points (1) and (3) need only doc/description framing — no code change. Will ping you on the PR once it's up. Thanks again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
