amyshields opened a new issue, #41884:
URL: https://github.com/apache/airflow/issues/41884

   ### Apache Airflow version
   
   2.9.3
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   We have seen this issue several times. 
   
   1. A task failed
   Up to 5 minutes go by (this is the longest we have seen the wait)
   2. The task itself is marked as `FAILED`
   3. All downstream tasks are marked as `upstream_failed`
   
   It is important to note, we also see this behaviour for a task succeeding 
(not being reflected in Airflow UI or its metadata DB).
   
   We have validated this by also making a call to Airflow's API to retrieve 
the task instance & the state has not been reflected as we would expect. 
   
   This exact case happened today (30th Aug) with a 2 minute delay:
   1. task_A failed today at 7:22 BST - 
   <img width="1673" alt="Screenshot 2024-08-30 at 09 29 15" 
src="https://github.com/user-attachments/assets/8af892b7-13c1-40ad-b1cf-972a7c4c8841";>
   2. One of its downstreams is in a None state at 7:23:00am BST
   <img width="1051" alt="Screenshot 2024-08-30 at 09 24 33" 
src="https://github.com/user-attachments/assets/84ab7cd9-b0e0-4e10-a6b5-57a60959b89d";>
   3. Then the downstream is set to a upstream failed state at 7:25am BST 
   <img width="1658" alt="Screenshot 2024-08-30 at 09 31 18" 
src="https://github.com/user-attachments/assets/7dde2d5b-1374-4753-a3bd-145c9bafcda0";>
   
   
   ### What you think should happen instead?
   
   1. A task failed
   <Little to no wait>
   2. The task itself is marked as `FAILED`
   3. All downstream tasks are marked as `upstream_failed`
   
   We do not expect any delay in the task being marked with its appropriate 
state nor the marking of any downstreams.
   
   ### How to reproduce
   
   This is hard to reproduce as unfortunately the metadata db (task instance 
table) only ever stores the latest state of a task (to minimize production 
downtime we are immediately retrying failed tasks and then subsequently will 
succeed and we dont get the first state stored). Possibly cold look into 
insertion timestamps and task completion timestamp and look at the delay here.
   
   ### Operating System
   
   linux/arm64
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   We use this docker image: apache/airflow:2.9.3-python3.9
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to