mwkang commented on issue #8590:
URL: https://github.com/apache/storm/issues/8590#issuecomment-4436047615

   Thanks @rzo1 — really appreciate the careful read. The eight points are all 
fair and several of them catch things I would have shipped as-is. Quick inline 
replies below; I'll fold the doc / PR-description items into the PR write-up 
and the code items into the patch itself.
   
   ### 1. RAS scope
   
   Intentionally out of scope for this change. `ResourceAwareScheduler` calls 
`Cluster.needsSchedulingRas` (a separate method), and the new opt-in branch 
lives in `Cluster.needsScheduling`, so RAS keeps its existing trigger and never 
enters `EvenScheduler.scheduleTopologiesEvenly`. A parallel mechanism for RAS 
would belong in a follow-up — the placement engine is different enough that it 
warrants its own design. I'll state this explicitly in the PR description so 
RAS operators aren't left guessing.
   
   ### 2. "Relocation = worker JVM restart" note
   
   Agreed, this is missing. I'll add a **"When you would NOT want to enable 
this"** section to the docs covering:
   
   - topologies with windowed / stateful bolts that pay non-trivial replay cost,
   - topologies sensitive to JIT warmup,
   - clusters whose supervisors flap (see (5)),
   - RAS users (no effect, see (1)).
   
   The proposal sells the upside; the docs should sell the downside just as 
clearly.
   
   ### 3. Isolation / Multitenant interaction
   
   Already safe, but I agree it should be explicit and tested.
   
   - **`IsolationScheduler`**: blacklists isolated hosts before delegating 
leftover topologies to `DefaultScheduler.defaultSchedule`, which calls 
`EvenScheduler.scheduleTopologiesEvenly`. Both `hasIdleSupervisorReusableBy` 
and `redistributeOntoIdleSupervisors` skip blacklisted supervisors, so an 
isolated host can never be a donor *or* a target. The "isolated topology is 
down → host looks idle" case is covered by the same blacklist check.
   - **`MultitenantScheduler`** doesn't route through 
`EvenScheduler.scheduleTopologiesEvenly` at all (it uses its own `DefaultPool` 
/ `IsolatedPool`), so the new branch is unreachable from it.
   
   I'll add two explicit tests:
   - IsolationScheduler + idle non-isolated supervisor → relocation only 
outside the isolated set.
   - IsolationScheduler with the isolated topology down → the 
isolation-reserved host is not treated as idle.
   
   ### 4. "Most-loaded supervisor" — measure and tie-break
   
   Today:
   - **Measure**: worker count per supervisor for that topology (not executor 
count). Workers are the unit being relocated, so worker count is the natural 
metric.
   - **Tie-break**: currently relies on `HashMap` iteration order — which is 
not deterministic. That's a real operator-observability bug, good catch.
   
   Fix: deterministic secondary key (supervisor id, lexicographic) with a test 
pinning the choice. PR description will spell out the rule so operators can 
reason about which worker moved and why.
   
   ### 5. Flap guard
   
   Convincing. The binary trigger blocks *thrash*, but not the case where a 
supervisor returns, absorbs workers, and then disappears 30s later. I'll add a 
third opt-in knob:
   
   ```
   nimbus.even.rebalance.idle.supervisor.min.stable.rounds  (default: 3)
   ```
   
   A supervisor that has been present-and-idle for fewer than N consecutive 
scheduling rounds will not be considered eligible. Gated independently of the 
master switch so operators can tune it without re-evaluating the whole feature.
   
   Implementation note: `SupervisorInfo.uptime_secs` is already available on 
the thrift side, so the cleanest path is to surface it on `SupervisorDetails` 
and gate `hasIdleSupervisorReusableBy` on `uptime >= minStableRounds * 
supervisor.monitor.frequency.secs`. No new state machine needed.
   
   ### 6. `needsScheduling` call-path audit
   
   Verified across the tree:
   
   | Caller | Method used | Affected? |
   |---|---|---|
   | `DefaultScheduler.defaultSchedule` | `needsSchedulingTopologies` → 
`needsScheduling` | Yes, but gated by the (default-off) flag |
   | `EvenScheduler.scheduleTopologies` | `needsSchedulingTopologies` → 
`needsScheduling` | Yes, same gating |
   | `ResourceAwareScheduler` | `needsSchedulingRas` | **No** — separate method 
|
   | `IsolationScheduler` | Indirect, via leftover topologies through 
`DefaultScheduler.defaultSchedule` | Reached but neutralized — blacklist masks 
the isolated set (see (3)) |
   | `MultitenantScheduler` | Doesn't go through `needsScheduling` | **No** |
   
   I'll include this table in the PR description.
   
   ### 7. Config key naming
   
   Will fix. Renaming to dot-only:
   
   - `nimbus.even.rebalance.idle-supervisor.enabled` → 
`nimbus.even.rebalance.idle.supervisor.enabled`
   - `nimbus.even.rebalance.max-free-per-topology` → 
`nimbus.even.rebalance.max.free.per.topology`
   - `nimbus.even.rebalance.idle.supervisor.min.stable.rounds` *(new, from (5))*
   
   For reference, `conf/defaults.yaml` has ~280 dotted-only keys versus a 
single dash-containing one (`storm.auth.simple-white-list.users`), so dot-only 
is clearly the house style. Easier to fix now, before docs and release notes 
lock it in.
   
   ### 8. Blast radius
   
   Your reading is correct. In one scheduling pass the simultaneous restart 
count is
   
   ```
   min(idle_slots, eligible_topologies)
   ```
   
   with the per-topology contribution capped by
   
   ```
   floor(numWorkers / nonBlacklistedSupervisorCount) * idleSupervisorCount
   ```
   
   (further tightened by `max.free.per.topology` when set to a positive value).
   
   I'll lift your worked example into the docs verbatim: "one returning 
supervisor with 8 slots in a 50-topology cluster → 8 simultaneous worker 
restarts across 8 topologies in one pass." That's the number operators need to 
see before opting in.
   
   Happy to add a cluster-wide ceiling on top of the per-topology one — 
something like `nimbus.even.rebalance.max.relocations.per.round` — if reviewers 
want a hard cap independent of topology shape. Let me know your preference; I'd 
lean toward adding it (cheap, conservative-by-default) but don't want to bolt 
on knobs that aren't pulling weight.
   
   ---
   
   Plan from here: I'll fold (2), (4), (5), (7), (8), the audit table from (6), 
and the two new tests from (3) into the PR and open it. Points (1) and (3) need 
only doc/description framing — no code change. Will ping you on the PR once 
it's up. Thanks again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to