DiogoP98 opened a new issue, #8419:
URL: https://github.com/apache/storm/issues/8419
## Description
After a Nimbus deployment or restart, healthy topologies with the correct
number of running workers are being rescheduled. This manifests as workers
being killed and restarted unnecessarily, causing intermittent processing
disruption until the cluster stabilizes.
### To Reproduce
1. Have one or more topologies running with low/idle tuple throughput.
2. Deploy Nimbus and leadership change.
3. Observe repeated "Executor X not alive" log entries in Nimbus for
executors that are actually running.
4. Observe topologies being rescheduled despite having the correct number of
workers.
### Expected behavior
After Nimbus restarts and reconnects to the cluster, healthy topologies
should remain stable with no rescheduling.
### Actual behavior
Nimbus marks alive executors as timed out, causing their workers to be
excluded from the assignment count. Since numAssignedWorkers <
numDesiredWorkers, Nimbus continuously triggers rescheduling for those
topologies.
### Identified Root cause
In `HeartbeatCache.updateFromHb()`, the internal liveness timestamp
(nimbusTimeSecs) is only refreshed when the executor's heartbeat stats
timestamp (TIME_SECS) changes between consecutive heartbeats. For RPC-based
heartbeats, `TIME_SECS` is the worker's wall-clock send time — if two
heartbeats are processed within the same second, the value is identical and
nimbusTimeSecs is not updated. After `nimbus.task.timeout.secs` (default 30s)
without a `nimbusTimeSecs` refresh, the executor is considered dead even though
it is actively heartbeating.
This is most visible after a Nimbus restart when the cache is empty and all
executors need to re-establish their liveness within the timeout window.
### Proposed fix
Update `nimbusTimeSecs` on every received heartbeat rather than only when
`TIME_SECS` changes, so that liveness is correctly tied to heartbeat arrival
rather than stats freshness.
### Environment
- Apache Storm 2.8.0
- Java 11
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]