dirrao opened a new issue, #35675:
URL: https://github.com/apache/airflow/issues/35675

   ### Apache Airflow version
   
   main (development)
   
   ### What happened
   
   Schedulers are racing for pod adoption when there is a delay in schedulers' 
heartbeats. However, the schedulers are alive but not dead their heartbeat is 
delayed due to network timeout or heavy processing, etc. This leads to a leak 
in the executor.running_tasks slots. Eventually, the schedulers are not able to 
launch the pods due to executor.running_tasks=parallelism.
   
   ### What you think should happen instead
   
   We should remove the entry from the Kubernetes executor running queue when 
we worker pod deleted / moved to another scheduler. 
   
   ### How to reproduce
   
   Reduce the scheduler_health_check_threshold=5 and 
orphaned_tasks_check_interval=10 values in the airflow config file
   Launch the airflow with two schedulers and try to schedule multiple DAGs 
with backfill for every 1/5 mins.
   
   ### Operating System
   
   CentOS 6
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes=7.9.0
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   Terraform 
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to