coenvd opened a new issue, #58667: URL: https://github.com/apache/airflow/issues/58667
### Apache Airflow version 3.1.3 ### If "Other Airflow 2/3 version" selected, which one? _No response_ ### What happened? Since our migration to Airflow 3.X.X we see stability issues in tasks that the DAG processor is responsible for. Some setup info: - We chose for a standalone DAG processor as this seems to be the way forward in Airflow 3. - We are running Airflow with the kubernetes executor, 2 schedulers, 2 webservers and 1 DAG processor. - We have about 20 python DAG files that generate about 630 DAGs by looping through yaml configuration files. - Our DAG folder is copied into the docker image when we deploy our env, and thus we have no DAG syncing or config updates outside of deploying new images. - We use AWS RDS postgres as our metadata database ### DAG serialization time Initially, after migrating from Airflow 2.11 to 3.X.X, we started getting DAG serialization timeouts, which we never had in Airflow 2. After optimizing all our dag generators to not use any database calls outside of the DAG run (mainly by moving Variable.get calls), we only get DAG serialization timeout issues when we are amidst deploying a new revision. I guess that acceptable, but because there is no health endpoint in the DAG Processor to determine readiness, it remains ambiguous whether the DAG processor is actually working as expected. When we accept this issue, we still see the below stability issues. ### DAG clearing & scheduling after that Sometimes we see that subsequent DAG runs are not scheduled after we manually clear a DAG run. We have to restart the environment, or delete the DAG in order to fix this. We cannot find any logs whatsoever that indicate an issue here, but that may have to do with the fact that we cant see this happening until 24-36 hours later, when dependent DAGs start to fail because the upstream DAG was not scheduled. ### DAG processor health checks I've tried many setting changes, but I cannot make the DAG Processor show up as "healthy" in the web UI overview nor the CLI health command. I checked the database, which shows a healthy and running DagProcessorJob, albeit that the last heartbeat varies between 10-90 seconds. When checking with the CLI command, I always get a result that reports unhealthy, even also when the last heartbeat is less than 30 seconds: ``` > $ date && curl https://${HOSTNAME}/api/v2/monitor/health Mon Nov 24 05:29:08 PM CET 2025 {"metadatabase":{"status":"healthy"},"scheduler":{"status":"healthy","latest_scheduler_heartbeat":"2025-11-24T16:29:00.501971+00:00"},"triggerer":{"status":null,"latest_triggerer_heartbeat":null},"dag_processor":{"status":"unhealthy","latest_dag_processor_heartbeat":"2025-11-24T16:28:53.902526+00:00"}} ``` Also, the documentation writes that the [dag processor health status works the same as the scheduler](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#checking-airflow-health-status), but this does not seem to be the case in the above example. Its unclear to me how to check readiness & liveness check for the DAG processor should be configured, as it does not seem to have a health endpoint exposed. FYI, currently we have the following config set to influence the DAG processor: ``` "AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD" = "150" # We expect this to loosen the internal health check of the DAG Processor as well "AIRFLOW__DAG_PROCESSOR__PARSING_PROCESSES" = "8" "AIRFLOW__DAG_PROCESSOR__DAG_FILE_PROCESSOR_TIMEOUT" = "120" ``` ### What you think should happen instead? - DAGs should always be scheduled again after manual intervention (clearing DAG runs) - The DAG processor should allow some form of readiness & liveness checking - The health check of the DAG processor should behave the same as the scheduler in the UI/CLI. ### How to reproduce ``` date && curl https://${HOSTNAME}/api/v2/monitor/health ``` or ``` SELECT * FROM job WHERE job_type = 'DagProcessorJob' ORDER BY latest_heartbeat desc limit 100; ``` ### Operating System Debian GNU/Linux 12 (bookworm) ### Versions of Apache Airflow Providers apache-airflow-providers-amazon==9.16.0 apache-airflow-providers-cncf-kubernetes==10.9.0 apache-airflow-providers-common-compat==1.7.4 apache-airflow-providers-common-io==1.6.3 apache-airflow-providers-common-sql==1.28.1 apache-airflow-providers-databricks==7.7.4 apache-airflow-providers-fab==3.0.1 apache-airflow-providers-google==18.1.0 apache-airflow-providers-http==5.4.0 apache-airflow-providers-microsoft-azure==12.8.0 apache-airflow-providers-salesforce==5.11.3 apache-airflow-providers-smtp==2.2.1 apache-airflow-providers-standard==1.8.0 ### Deployment Other Docker-based deployment ### Deployment details _No response_ ### Anything else? _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
