DiogoP98 opened a new issue, #8419:
URL: https://github.com/apache/storm/issues/8419

   ## Description
   
   After a Nimbus deployment or restart, healthy topologies with the correct 
number of running workers are being rescheduled. This manifests as workers 
being killed and restarted unnecessarily, causing intermittent processing 
disruption until the cluster stabilizes.
   
   ### To Reproduce
   1. Have one or more topologies running with low/idle tuple throughput.
   2. Deploy Nimbus and leadership change.
   3. Observe repeated "Executor X not alive" log entries in Nimbus for 
executors that are actually running.
   4. Observe topologies being rescheduled despite having the correct number of 
workers.
   
   ### Expected behavior
   
   After Nimbus restarts and reconnects to the cluster, healthy topologies 
should remain stable with no rescheduling.
   
   ### Actual behavior
   
   Nimbus marks alive executors as timed out, causing their workers to be 
excluded from the assignment count. Since numAssignedWorkers < 
numDesiredWorkers, Nimbus continuously triggers rescheduling for those 
topologies.
   
   ### Identified Root cause
   
   In `HeartbeatCache.updateFromHb()`, the internal liveness timestamp 
(nimbusTimeSecs) is only refreshed when the executor's heartbeat stats 
timestamp (TIME_SECS) changes between consecutive heartbeats. For RPC-based 
heartbeats, `TIME_SECS` is the worker's wall-clock send time — if two 
heartbeats are processed within the same second, the value is identical and 
nimbusTimeSecs is not updated. After `nimbus.task.timeout.secs` (default 30s) 
without a `nimbusTimeSecs` refresh, the executor is considered dead even though 
it is actively heartbeating.
   
   This is most visible after a Nimbus restart when the cache is empty and all 
executors need to re-establish their liveness within the timeout window.
   
   ### Proposed fix
   
   Update `nimbusTimeSecs` on every received heartbeat rather than only when 
`TIME_SECS` changes, so that liveness is correctly tied to heartbeat arrival 
rather than stats freshness.
   
   ### Environment
    
   - Apache Storm 2.8.0
   - Java 11


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to