wolfier opened a new issue, #42553: URL: https://github.com/apache/airflow/issues/42553
### Apache Airflow version Other Airflow 2 version (please specify below) ### If "Other Airflow 2 version" selected, which one? 2.9.3 ### What happened? A task instance's LocalTaskJobRunner exited without addressing the task instance state which is expected. The task instance will be identified as a zombie. The issue is that the task instance is identified as a zombie three times and each time a TaskCallbackRequest is created. The reason why three callback requests were sent is because the DagFileProcessorProcess did not parse the file in a timely manner. There were about 600 seconds between each parse which allowed the zombie detection operation to find the same task instance multiple times. Eventually when the the source file is parsed, the DagFileProcessorProcess executed all the TaskCallbackRequests and moved the task from `running` to `up_to_retry` then to `failed`. This was previously reported in #31212 ### What you think should happen instead? Ideally, each task instance attempt should only be identified once regardless of how it is done. I think the following options are viable: * deduplicating the TaskCallbackRequest / DbCallbackRequest * tie TaskCallbackRequest to the task instance key and check before it is sent to the DatabaseCallbackSink ### How to reproduce 1. Increase min_file_process_interval to 600 seconds 2. Create zombies by causing the LocalTaskJobRunner to terminate (delete the PgBouncer) 3. Confirm the same task instance attempt is identified as zombies between source file parsings 4. Confirm the DagFileProcessorProcess parses the file and that the on_failure_callback runs multiple times ### Operating System n/a ### Versions of Apache Airflow Providers _No response_ ### Deployment Astronomer ### Deployment details _No response_ ### Anything else? _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
