Ashish0253 commented on issue #28116:
URL: https://github.com/apache/airflow/issues/28116#issuecomment-3734479444

   I would like to work on this issue. I have successfully reproduced the 
'Zombie'/Heartbeat timeout scenario in a local Airflow 3.0 Breeze environment.
   
   My findings: I've identified that when the SchedulerJobRunner detects a 
stale heartbeat (around line 2732), it logs the error to the console but does 
not persist this reason to the database. Consequently, the UI logs for the Task 
Instance remain empty or uninformative because the failure reason isn't 
captured in the metadata.
   
   **Proposed Approach:**
   
   Model Update: Add a state_reason (or similar) string column to the 
TaskInstance model to store system-detected failure reasons.
   
   Scheduler Update: Modify the zombie/heartbeat reaper logic in 
scheduler_job_runner.py to populate this field when marking a TI as failed.
   
   API/UI: Ensure the Internal API includes this field so the UI can display a 
'System Error' banner or note to the user.
   
   If I am approaching this from the wrong angle or if there's an existing 
mechanism for storing system-level failure reasons I should be using instead, 
please point me in the right direction!
   
   Could you please assign this to me?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to