wolfier opened a new issue, #42553:
URL: https://github.com/apache/airflow/issues/42553

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.9.3
   
   ### What happened?
   
   A task instance's LocalTaskJobRunner exited without addressing the task 
instance state which is expected. The task instance will be identified as a 
zombie.
   
   The issue is that the task instance is identified as a zombie three times 
and each time a TaskCallbackRequest is created. The reason why three callback 
requests were sent is because the DagFileProcessorProcess did not parse the 
file in a timely manner. There were about 600 seconds between each parse which 
allowed the zombie detection operation to find the same task instance multiple 
times.
   
   Eventually when the the source file is parsed, the DagFileProcessorProcess 
executed all the TaskCallbackRequests and moved the task from `running` to 
`up_to_retry` then to `failed`.
   
   This was previously reported in #31212
   
   ### What you think should happen instead?
   
   Ideally, each task instance attempt should only be identified once 
regardless of how it is done. 
   
   I think the following options are viable:
   * deduplicating the TaskCallbackRequest / DbCallbackRequest
   * tie TaskCallbackRequest to the task instance key and check before it is 
sent to the DatabaseCallbackSink 
   
   
   ### How to reproduce
   
   1. Increase min_file_process_interval to 600 seconds
   2. Create zombies by causing the LocalTaskJobRunner to terminate (delete the 
PgBouncer)
   3. Confirm the same task instance attempt is identified as zombies between 
source file parsings
   4. Confirm the DagFileProcessorProcess parses the file and that the 
on_failure_callback runs multiple times
   
   ### Operating System
   
   n/a
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to