DiogoP98 opened a new pull request, #8420:
URL: https://github.com/apache/storm/pull/8420
## What is the purpose of the change
After a Nimbus restart, healthy topologies with the correct number of
running workers are repeatedly rescheduled. In the Nimbus logs, the following
message appears for executors that are actually running:
```
Executor <topo-id>:<executor> not alive
```
This causes nimbus to detect workers down, which continuously triggers
rescheduling for those topologies.
### Root Cause
In `HeartbeatCache`, the internal liveness timestamp (`nimbusTimeSecs`) was
only refreshed when the heartbeat's `TIME_SECS` value changed between
consecutive calls.
For RPC-based heartbeats, `TIME_SECS` represents the worker's wall-clock
send time. If two heartbeats are processed within the same second, `TIME_SECS`
is identical, `nimbusTimeSecs` is not refreshed, and after
`nimbus.task.timeout.secs` (default 30s) the executor is falsely considered
dead — even though it is actively heartbeating.
### Fix
The heartbeat update logic is split into two separate methods to fix the RPC
path while preserving backwards compatibility with legacy ZK-based topologies:
- updateFromRpcHb — always refreshes nimbusTimeSecs on every heartbeat,
so idle-but-alive executors are never falsely timed out.
- updateFromZkHb — retains the original behaviour, only refreshing
nimbusTimeSecs when TIME_SECS advances. This preserves zombie detection for
legacy topologies where TIME_SECS is stats-based and genuinely stops advancing
when an executor is stuck.
https://github.com/apache/storm/issues/8419
## How was the change tested
Compiled the project, and tested the new version of the code in Nimbus
machines, forcing Nimbus to download a non-existent key.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]