Raul824 opened a new issue, #35107:
URL: https://github.com/apache/airflow/issues/35107
### Apache Airflow version
Other Airflow 2 version (please specify below)
### What happened
Airflow 2.6.1
Running in Azure AKS.
Keda Auto scaling 0-30 workers
Worker Concurrency 16.
Backend DB - Potsgres
We are using airflow to run job on databricks using submit run api.
Our jobs are being killed in between the run, the reason is they are being
marked as zombie.
Below is the cause I can come up with after seeing the details of a failed
job, it could be inaccurate as the logic has been built after observing the
behavior of failed jobs.
Airflow is sending job to a worker A but celery is running same job on
worker B.
Airflow is trying to get the status from the worker A causes heartbeat miss
and mark it as zombie and kills it.
Below is the log from Airflow scheduler.
[2023-10-19T12:12:38.862+0000] {scheduler_job_runner.py:1683} WARNING -
Failing (1) jobs without heartbeat after 2023-10-19 12:07:38.854638+00:00
[2023-10-19T12:12:38.862+0000] {scheduler_job_runner.py:1693} ERROR -
Detected zombie job: {'full_filepath':
'/opt/airflow/dags/UDPPRDAU_ODS_KEY_SCD_SCF_11.py', 'processor_subdir':
'/opt/airflow/dags', 'msg': "{'DAG Id': 'UDPPRDAU_ODS_KEY_SCD_SCF_11', 'Task
Id': 'SSOT_DDS_ASSG_PROD_SCD.SSOT_DDS_ASSG_PROD_SCD', 'Run Id':
'manual__2023-10-17T12:59:00+00:00', 'Hostname':
'optusairflow-worker-7458876cdf-glk6z', 'External Executor Id':
'c5746e3a-c8f8-4596-a31d-132413d5591c'}", 'simple_task_instance':
<airflow.models.taskinstance.SimpleTaskInstance object at 0x7f4c99bb3790>,
'is_failure_callback': True}
Below is the snippet from celery for same external executor id running on
different worker than mentioned in above airflow log.

This issue is fixed if we set 10 workers running but this will cause the
workers to be running in case of no jobs which will increase the cost.
Related Issue #35056
### What you think should happen instead
If airflow is running a job onto a specific worker, celery should run it on
same worker.
If the worker is about to be shut down, celery should mark it in some state
so that Airflow cannot submit a job on a worker which is about to be shut down
due to Auto Scaling.
### How to reproduce
Set the worker scaling through Keda from 0-10 and run more than 40 jobs.
### Operating System
Azure Kubernetes Services
### Versions of Apache Airflow Providers
2.6.1
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else
The occurrence of these failures is too high almost 10 jobs per 30 run.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]