mwkang opened a new issue, #8590:
URL: https://github.com/apache/storm/issues/8590

   ## Summary
   
   Currently `EvenScheduler` (and therefore `DefaultScheduler`) does not move 
workers onto a fully-idle supervisor that returns to service after maintenance. 
The cluster keeps running with `used slot = 0` on the returned supervisor until 
an operator manually rebalances or restarts every affected topology.
   
   This proposal introduces an **opt-in, binary-trigger** rebalance pass inside 
`EvenScheduler`. When at least one non-blacklisted supervisor has zero used 
slots, the scheduler relocates one worker per topology per round-robin 
iteration onto the idle slots, until the idle capacity reaches an even 
per-supervisor share. Already-balanced topologies are never touched.
   
   The feature is **disabled by default** so existing behavior is preserved.
   
   ---
   
   ## Problem statement
   
   Today's `Cluster#needsScheduling` returns true only when the topology has 
fewer assigned workers than desired or has unassigned executors:
   
   ```java
   public boolean needsScheduling(TopologyDetails topology) {
       int desiredNumWorkers = topology.getNumWorkers();
       int assignedNumWorkers = this.getAssignedNumWorkers(topology);
       return desiredNumWorkers > assignedNumWorkers
           || getUnassignedExecutors(topology).size() > 0;
   }
   ```
   
   When a supervisor goes down for maintenance and comes back, the topology's 
workers were already redistributed onto the surviving supervisors, so both 
conditions are false and the scheduler considers the cluster "fully scheduled". 
The returning supervisor sits idle indefinitely.
   
   ```
   Initial state                  After supervisor maintenance
   sup-A    sup-B    sup-C        sup-A    sup-B    sup-C
   +-+-+-+  +-+-+-+  +-+-+-+      +-+-+-+  +-+-+-+
   |W|W| |  |W|W| |  |W|W| |      |W|W|W|  |W|W|W|       (sup-C is down)
   +-+-+-+  +-+-+-+  +-+-+-+      +-+-+-+  +-+-+-+
   
   After sup-C returns                  Today's behavior
   sup-A    sup-B    sup-C              sup-A    sup-B    sup-C
   +-+-+-+  +-+-+-+  +-+-+-+            +-+-+-+  +-+-+-+  +-+-+-+
   |W|W|W|  |W|W|W|  | | | |    -->     |W|W|W|  |W|W|W|  | | | |
   +-+-+-+  +-+-+-+  +-+-+-+            +-+-+-+  +-+-+-+  +-+-+-+
                                                          (stays idle forever)
   ```
   
   Operators must currently restart each affected topology by hand to drain the 
over-loaded supervisors.
   
   ---
   
   ## Proposed approach
   
   ### 1. Binary trigger
   
   `Cluster#needsScheduling` gets one extra branch: also return true when at 
least one non-blacklisted supervisor has **zero used slots** and the topology 
is not already on that supervisor. The check is binary — a supervisor either 
has zero used slots or it does not — so a near-balanced cluster never triggers 
this path.
   
   ```
   Triggers       sup-A    sup-B    sup-C
                  +-+-+-+  +-+-+-+  +-+-+-+
                  |W|W|W|W|W|W|W|W| | | | |   <-- triggers
                  +-+-+-+  +-+-+-+  +-+-+-+
   
   Does not       sup-A    sup-B    sup-C
   trigger        +-+-+-+  +-+-+-+  +-+-+-+
                  |W|W|W|W|W|W|W| | |W| | |   <-- (4, 3, 1) is not balanced
                  +-+-+-+  +-+-+-+  +-+-+-+       but sup-C has used > 0
                                                   so trigger does not fire
   ```
   
   ### 2. Round-robin relocation across topologies
   
   A new phase at the start of `scheduleTopologiesEvenly` walks idle slots 
round-robin across all triggering topologies. Each iteration moves at most one 
worker per topology, so the returning supervisor ends up hosting workers from 
many topologies — preserving the per-supervisor workload diversity that a fresh 
submission produces.
   
   ```
   Topologies A, B, C, D each at distribution (4, 4, 0).
   sup-C just returned with 4 free slots.
   
   Round-robin iterations:
     iter 1: A's worker -> sup-C:0
     iter 2: B's worker -> sup-C:1
     iter 3: C's worker -> sup-C:2
     iter 4: D's worker -> sup-C:3
     done (idle capacity exhausted).
   
   Result on sup-C: { worker of A, worker of B, worker of C, worker of D }
   ```
   
   If only one topology triggers, that topology fills the idle supervisor up to 
its even share and stops. If only some topologies are eligible, the remaining 
idle slots stay empty for this round.
   
   ### 3. Per-topology budget per round
   
   Each topology contributes at most
   `idleSupervisorCount * floor(numWorkers / nonBlacklistedSupervisorCount)`
   workers in one scheduling round, capped further by the idle side's free slot 
capacity. This means:
   
   - topologies with `numWorkers < numSupervisors` automatically skip the 
round-robin (their floor is zero) — single-worker topologies cannot ping-pong;
   - "near balanced" cluster never triggers in the first place;
   - one round produces an even per-supervisor share and the next round's 
trigger is silent.
   
   ### 4. Direct placement onto idle slots
   
   The relocated executors are assigned directly onto idle slots via 
`cluster.freeSlot(victim)` + `cluster.assign(idleSlot, ...)`. This bypasses the 
regular `sortSlots`/`interleaveAll` pass that would otherwise be free to drop 
some of the freed executors back into the just-vacated slots on the busy 
supervisors.
   
   ---
   
   ## Configuration keys
   
   Two new entries in `DaemonConfig`:
   
   | Key | Type | Default | Meaning |
   |---|---|---|---|
   | `nimbus.even.rebalance.idle-supervisor.enabled` | boolean | `false` | 
Master switch. When false, the new code paths are short-circuited and behavior 
is identical to today. |
   | `nimbus.even.rebalance.max-free-per-topology` | int | `0` | Optional upper 
bound on the number of workers a single topology may release per round. `0` or 
negative means unbounded (the per-round budget above takes effect). |
   
   ---
   
   ## Safety guards (already in the draft)
   
   - **Opt-in (default off)**: zero behavior change for existing deployments.
   - **Binary trigger**: an "almost balanced" cluster cannot trigger a 
relocation; no time-based cooldown is needed because the trigger condition 
itself rules out instability.
   - **Drain-to-zero protection**: a supervisor that holds only one worker of a 
topology is never the donor — the same round can never produce a new idle 
supervisor.
   - **Round-robin across topologies**: prevents the first-scheduled topology 
from monopolizing the idle capacity, which would concentrate one workload on 
the returning supervisor.
   - **Idle-side slot cap**: the round-robin pass never tries to free more 
workers than the idle capacity can absorb.
   - **Direct placement**: relocated executors never bounce back onto the 
just-vacated busy supervisors.
   
   ---
   
   ## Implementation outline
   
   The draft (against `master` / `3.0.0-SNAPSHOT`) touches three production 
files plus a test:
   
   ```
   storm-server/src/main/java/org/apache/storm/DaemonConfig.java
      - Two new keys (above).
   
   storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java
      - needsScheduling: extra branch via hasIdleSupervisorReusableBy().
      - hasIdleSupervisorReusableBy(): binary check, returns false when 
disabled.
   
   storm-server/src/main/java/org/apache/storm/scheduler/EvenScheduler.java
      - scheduleTopologiesEvenly(): calls redistributeOntoIdleSupervisors() 
first.
      - redistributeOntoIdleSupervisors(): round-robin across topologies.
      - relocateOneWorkerOntoIdleSlot(): pulls one worker from the most-loaded
        supervisor of a topology, places it directly onto an idle slot.
   
   conf/defaults.yaml
      - Defaults for the two new keys.
   
   storm-server/src/test/java/.../TestEvenSchedulerIdleSupervisor.java
      - 9 unit tests covering trigger, drain cap, single-worker no-op,
        drain-to-zero protection, one-round even distribution, and
        round-robin sharing across multiple topologies.
   ```
   
   I will open the PR once this proposal direction is agreed upon. Happy to 
file a `STORM-XXXX` JIRA ticket and rebase under that key.
   
   ---
   
   ## Backward compatibility
   
   - Default off, so out-of-the-box behavior of `DefaultScheduler` / 
`EvenScheduler` is byte-for-byte identical.
   - No public API removed or renamed. Two new public methods 
(`Cluster#hasIdleSupervisorReusableBy`, 
`EvenScheduler#redistributeOntoIdleSupervisors`) are additions only.
   - New config keys both have safe defaults; no new YAML is required for 
existing clusters.
   
   Feedback welcome on direction, naming, and scope before I open the PR. 
Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to