coenvd opened a new issue, #58667:
URL: https://github.com/apache/airflow/issues/58667

   ### Apache Airflow version
   
   3.1.3
   
   ### If "Other Airflow 2/3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   Since our migration to Airflow 3.X.X we see stability issues in tasks that 
the DAG processor is responsible for. Some setup info:
   
   - We chose for a standalone DAG processor as this seems to be the way 
forward in Airflow 3. 
   - We are running Airflow with the kubernetes executor, 2 schedulers, 2 
webservers and 1 DAG processor. 
   - We have about 20 python DAG files that generate about 630 DAGs by looping 
through yaml configuration files. 
   - Our DAG folder is copied into the docker image when we deploy our env, and 
thus we have no DAG syncing or config updates outside of deploying new images.
   - We use AWS RDS postgres as our metadata database
   
   ### DAG serialization time
   Initially, after migrating from Airflow 2.11 to 3.X.X, we started getting 
DAG serialization timeouts, which we never had in Airflow 2. After optimizing 
all our dag generators to not use any database calls outside of the DAG run 
(mainly by moving Variable.get calls), we only get DAG serialization timeout 
issues when we are amidst deploying a new revision. I guess that acceptable, 
but because there is no health endpoint in the DAG Processor to determine 
readiness, it remains ambiguous whether the DAG processor is actually working 
as expected. When we accept this issue, we still see the below stability issues.
   
   ### DAG clearing & scheduling after that
   Sometimes we see that subsequent DAG runs are not scheduled after we 
manually clear a DAG run. We have to restart the environment, or delete the DAG 
in order to fix this. We cannot find any logs whatsoever that indicate an issue 
here, but that may have to do with the fact that we cant see this happening 
until 24-36 hours later, when dependent DAGs start to fail because the upstream 
DAG was not scheduled.
   
   ### DAG processor health checks
   I've tried many setting changes, but I cannot make the DAG Processor show up 
as "healthy" in the web UI overview nor the CLI health command. I checked the 
database, which shows a healthy and running DagProcessorJob, albeit that the 
last heartbeat varies between 10-90 seconds. When checking with the CLI 
command, I always get a result that reports unhealthy, even also when the last 
heartbeat is less than 30 seconds:
   
   ```
   > $ date && curl https://${HOSTNAME}/api/v2/monitor/health
   
   Mon Nov 24 05:29:08 PM CET 2025
   
{"metadatabase":{"status":"healthy"},"scheduler":{"status":"healthy","latest_scheduler_heartbeat":"2025-11-24T16:29:00.501971+00:00"},"triggerer":{"status":null,"latest_triggerer_heartbeat":null},"dag_processor":{"status":"unhealthy","latest_dag_processor_heartbeat":"2025-11-24T16:28:53.902526+00:00"}}
   
   ```
   
   Also, the documentation writes that the [dag processor health status works 
the same as the 
scheduler](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#checking-airflow-health-status),
 but this does not seem to be the case in the above example. Its unclear to me 
how to check readiness & liveness check for the DAG processor should be 
configured, as it does not seem to have a health endpoint exposed.
   
   FYI, currently we have the following config set to influence the DAG 
processor:
   
   ```
   "AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD" = "150" # We expect 
this to loosen the internal health check of the DAG Processor as well
   "AIRFLOW__DAG_PROCESSOR__PARSING_PROCESSES"          = "8"
   "AIRFLOW__DAG_PROCESSOR__DAG_FILE_PROCESSOR_TIMEOUT" = "120"
   ```
   
   ### What you think should happen instead?
   
   - DAGs should always be scheduled again after manual intervention (clearing 
DAG runs)
   - The DAG processor should allow some form of readiness & liveness checking
   - The health check of the DAG processor should behave the same as the 
scheduler in the UI/CLI.
   
   ### How to reproduce
   
   ```
   date && curl https://${HOSTNAME}/api/v2/monitor/health
   ```
   
   or 
   
   ```
   SELECT *
   FROM job
   WHERE job_type = 'DagProcessorJob' 
   ORDER BY latest_heartbeat desc
   limit 100;
   ```
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworm)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==9.16.0
   apache-airflow-providers-cncf-kubernetes==10.9.0
   apache-airflow-providers-common-compat==1.7.4
   apache-airflow-providers-common-io==1.6.3
   apache-airflow-providers-common-sql==1.28.1
   apache-airflow-providers-databricks==7.7.4
   apache-airflow-providers-fab==3.0.1
   apache-airflow-providers-google==18.1.0
   apache-airflow-providers-http==5.4.0
   apache-airflow-providers-microsoft-azure==12.8.0
   apache-airflow-providers-salesforce==5.11.3
   apache-airflow-providers-smtp==2.2.1
   apache-airflow-providers-standard==1.8.0
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to