Ashish0253 commented on issue #28116: URL: https://github.com/apache/airflow/issues/28116#issuecomment-3734479444
I would like to work on this issue. I have successfully reproduced the 'Zombie'/Heartbeat timeout scenario in a local Airflow 3.0 Breeze environment. My findings: I've identified that when the SchedulerJobRunner detects a stale heartbeat (around line 2732), it logs the error to the console but does not persist this reason to the database. Consequently, the UI logs for the Task Instance remain empty or uninformative because the failure reason isn't captured in the metadata. **Proposed Approach:** Model Update: Add a state_reason (or similar) string column to the TaskInstance model to store system-detected failure reasons. Scheduler Update: Modify the zombie/heartbeat reaper logic in scheduler_job_runner.py to populate this field when marking a TI as failed. API/UI: Ensure the Internal API includes this field so the UI can display a 'System Error' banner or note to the user. If I am approaching this from the wrong angle or if there's an existing mechanism for storing system-level failure reasons I should be using instead, please point me in the right direction! Could you please assign this to me? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
