[PR] fix(HeartbeatCache): Falsely timing out alive executors when heartbeat TIME_SECS does not advance (storm)

via GitHub Thu, 05 Mar 2026 07:03:21 -0800


DiogoP98 opened a new pull request, #8420:
URL: https://github.com/apache/storm/pull/8420


   ## What is the purpose of the change
   
   After a Nimbus restart, healthy topologies with the correct number of 
running workers are repeatedly rescheduled. In the Nimbus logs, the following 
message appears for executors that are actually running:
   
   ```
   Executor <topo-id>:<executor> not alive
   ```
   
   This causes nimbus to detect workers down, which continuously triggers 
rescheduling for those topologies.
   
   ### Root Cause
   
   In `HeartbeatCache`, the internal liveness timestamp (`nimbusTimeSecs`) was 
only refreshed when the heartbeat's `TIME_SECS` value changed between 
consecutive calls.
   
   For RPC-based heartbeats, `TIME_SECS` represents the worker's wall-clock 
send time. If two heartbeats are processed within the same second, `TIME_SECS` 
is identical, `nimbusTimeSecs` is not refreshed, and after 
`nimbus.task.timeout.secs` (default 30s) the executor is falsely considered 
dead — even though it is actively heartbeating.
   
   ### Fix
   
   The heartbeat update logic is split into two separate methods to fix the RPC 
path while preserving backwards compatibility with legacy ZK-based topologies:
   
      - updateFromRpcHb — always refreshes nimbusTimeSecs on every heartbeat, 
so idle-but-alive executors are never falsely timed out.
      - updateFromZkHb — retains the original behaviour, only refreshing 
nimbusTimeSecs when TIME_SECS advances. This preserves zombie detection for 
legacy topologies where TIME_SECS is stats-based and genuinely stops advancing 
when an executor is stuck.
   
   
   https://github.com/apache/storm/issues/8419
   
   ## How was the change tested
   
   Compiled the project, and tested the new version of the code in Nimbus 
machines, forcing Nimbus to download a non-existent key.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(HeartbeatCache): Falsely timing out alive executors when heartbeat TIME_SECS does not advance (storm)

Reply via email to