mwkang opened a new issue, #8590:
URL: https://github.com/apache/storm/issues/8590
## Summary
Currently `EvenScheduler` (and therefore `DefaultScheduler`) does not move
workers onto a fully-idle supervisor that returns to service after maintenance.
The cluster keeps running with `used slot = 0` on the returned supervisor until
an operator manually rebalances or restarts every affected topology.
This proposal introduces an **opt-in, binary-trigger** rebalance pass inside
`EvenScheduler`. When at least one non-blacklisted supervisor has zero used
slots, the scheduler relocates one worker per topology per round-robin
iteration onto the idle slots, until the idle capacity reaches an even
per-supervisor share. Already-balanced topologies are never touched.
The feature is **disabled by default** so existing behavior is preserved.
---
## Problem statement
Today's `Cluster#needsScheduling` returns true only when the topology has
fewer assigned workers than desired or has unassigned executors:
```java
public boolean needsScheduling(TopologyDetails topology) {
int desiredNumWorkers = topology.getNumWorkers();
int assignedNumWorkers = this.getAssignedNumWorkers(topology);
return desiredNumWorkers > assignedNumWorkers
|| getUnassignedExecutors(topology).size() > 0;
}
```
When a supervisor goes down for maintenance and comes back, the topology's
workers were already redistributed onto the surviving supervisors, so both
conditions are false and the scheduler considers the cluster "fully scheduled".
The returning supervisor sits idle indefinitely.
```
Initial state After supervisor maintenance
sup-A sup-B sup-C sup-A sup-B sup-C
+-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+
|W|W| | |W|W| | |W|W| | |W|W|W| |W|W|W| (sup-C is down)
+-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+
After sup-C returns Today's behavior
sup-A sup-B sup-C sup-A sup-B sup-C
+-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+
|W|W|W| |W|W|W| | | | | --> |W|W|W| |W|W|W| | | | |
+-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+ +-+-+-+
(stays idle forever)
```
Operators must currently restart each affected topology by hand to drain the
over-loaded supervisors.
---
## Proposed approach
### 1. Binary trigger
`Cluster#needsScheduling` gets one extra branch: also return true when at
least one non-blacklisted supervisor has **zero used slots** and the topology
is not already on that supervisor. The check is binary — a supervisor either
has zero used slots or it does not — so a near-balanced cluster never triggers
this path.
```
Triggers sup-A sup-B sup-C
+-+-+-+ +-+-+-+ +-+-+-+
|W|W|W|W|W|W|W|W| | | | | <-- triggers
+-+-+-+ +-+-+-+ +-+-+-+
Does not sup-A sup-B sup-C
trigger +-+-+-+ +-+-+-+ +-+-+-+
|W|W|W|W|W|W|W| | |W| | | <-- (4, 3, 1) is not balanced
+-+-+-+ +-+-+-+ +-+-+-+ but sup-C has used > 0
so trigger does not fire
```
### 2. Round-robin relocation across topologies
A new phase at the start of `scheduleTopologiesEvenly` walks idle slots
round-robin across all triggering topologies. Each iteration moves at most one
worker per topology, so the returning supervisor ends up hosting workers from
many topologies — preserving the per-supervisor workload diversity that a fresh
submission produces.
```
Topologies A, B, C, D each at distribution (4, 4, 0).
sup-C just returned with 4 free slots.
Round-robin iterations:
iter 1: A's worker -> sup-C:0
iter 2: B's worker -> sup-C:1
iter 3: C's worker -> sup-C:2
iter 4: D's worker -> sup-C:3
done (idle capacity exhausted).
Result on sup-C: { worker of A, worker of B, worker of C, worker of D }
```
If only one topology triggers, that topology fills the idle supervisor up to
its even share and stops. If only some topologies are eligible, the remaining
idle slots stay empty for this round.
### 3. Per-topology budget per round
Each topology contributes at most
`idleSupervisorCount * floor(numWorkers / nonBlacklistedSupervisorCount)`
workers in one scheduling round, capped further by the idle side's free slot
capacity. This means:
- topologies with `numWorkers < numSupervisors` automatically skip the
round-robin (their floor is zero) — single-worker topologies cannot ping-pong;
- "near balanced" cluster never triggers in the first place;
- one round produces an even per-supervisor share and the next round's
trigger is silent.
### 4. Direct placement onto idle slots
The relocated executors are assigned directly onto idle slots via
`cluster.freeSlot(victim)` + `cluster.assign(idleSlot, ...)`. This bypasses the
regular `sortSlots`/`interleaveAll` pass that would otherwise be free to drop
some of the freed executors back into the just-vacated slots on the busy
supervisors.
---
## Configuration keys
Two new entries in `DaemonConfig`:
| Key | Type | Default | Meaning |
|---|---|---|---|
| `nimbus.even.rebalance.idle-supervisor.enabled` | boolean | `false` |
Master switch. When false, the new code paths are short-circuited and behavior
is identical to today. |
| `nimbus.even.rebalance.max-free-per-topology` | int | `0` | Optional upper
bound on the number of workers a single topology may release per round. `0` or
negative means unbounded (the per-round budget above takes effect). |
---
## Safety guards (already in the draft)
- **Opt-in (default off)**: zero behavior change for existing deployments.
- **Binary trigger**: an "almost balanced" cluster cannot trigger a
relocation; no time-based cooldown is needed because the trigger condition
itself rules out instability.
- **Drain-to-zero protection**: a supervisor that holds only one worker of a
topology is never the donor — the same round can never produce a new idle
supervisor.
- **Round-robin across topologies**: prevents the first-scheduled topology
from monopolizing the idle capacity, which would concentrate one workload on
the returning supervisor.
- **Idle-side slot cap**: the round-robin pass never tries to free more
workers than the idle capacity can absorb.
- **Direct placement**: relocated executors never bounce back onto the
just-vacated busy supervisors.
---
## Implementation outline
The draft (against `master` / `3.0.0-SNAPSHOT`) touches three production
files plus a test:
```
storm-server/src/main/java/org/apache/storm/DaemonConfig.java
- Two new keys (above).
storm-server/src/main/java/org/apache/storm/scheduler/Cluster.java
- needsScheduling: extra branch via hasIdleSupervisorReusableBy().
- hasIdleSupervisorReusableBy(): binary check, returns false when
disabled.
storm-server/src/main/java/org/apache/storm/scheduler/EvenScheduler.java
- scheduleTopologiesEvenly(): calls redistributeOntoIdleSupervisors()
first.
- redistributeOntoIdleSupervisors(): round-robin across topologies.
- relocateOneWorkerOntoIdleSlot(): pulls one worker from the most-loaded
supervisor of a topology, places it directly onto an idle slot.
conf/defaults.yaml
- Defaults for the two new keys.
storm-server/src/test/java/.../TestEvenSchedulerIdleSupervisor.java
- 9 unit tests covering trigger, drain cap, single-worker no-op,
drain-to-zero protection, one-round even distribution, and
round-robin sharing across multiple topologies.
```
I will open the PR once this proposal direction is agreed upon. Happy to
file a `STORM-XXXX` JIRA ticket and rebase under that key.
---
## Backward compatibility
- Default off, so out-of-the-box behavior of `DefaultScheduler` /
`EvenScheduler` is byte-for-byte identical.
- No public API removed or renamed. Two new public methods
(`Cluster#hasIdleSupervisorReusableBy`,
`EvenScheduler#redistributeOntoIdleSupervisors`) are additions only.
- New config keys both have safe defaults; no new YAML is required for
existing clusters.
Feedback welcome on direction, naming, and scope before I open the PR.
Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]