karakanb opened a new issue, #27100: URL: https://github.com/apache/airflow/issues/27100
### Apache Airflow version Other Airflow 2 version (please specify below) ### What happened I have been experiencing the tasks are getting stuck in running state forever when the DB is unreachable during the scheduling period for some reason. My setup: - Airflow v2.3.4 on Kubernetes - Postgres as the DB + pgbouncer in front of it, managed by DigitalOcean Here's a how it happened: - I have some hourly pipelines, all running various reporting tasks. - Right around the beginning of the hour on Oct 5th, around 10AM UTC, the database became unreachable for some reason. - The worker and scheduler instances start noticing the problem, the earliest 10:00:07 UTC. - The issue takes ~2 mins to resolve. - Around that time, a Dagrun is started for one of the hourly pipelines: <img width="1489" alt="image" src="https://user-images.githubusercontent.com/16530606/196298566-f98d21f9-0e66-4563-94d8-2e59e5a5ef4d.png"> This task instance gets stuck until we have cleared it out 12 days later, today. I have all the logs from all the Airflow components, but there is no mention of this specific tasks in the logs. There are many failure notifications. I'll share all the logs below, and in there you'll see some "ops-tracker` instances, but **they are not from the instance that is stuck, they belong to a different pipeline**. There is no logs / mention of the task instance that was stuck, nothing at all. In the end, the task instance got stuck in a "running" state for 12 days, and it prevented running other dagruns because I use `AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG` set to `1` intentionally. The pipeline has the following default args: ```python default_args = { "depends_on_past": False, "retries": 3, "retry_delay": timedelta(seconds=30), "trigger_rule": TriggerRule.NONE_FAILED, "execution_timeout": timedelta(hours=24), } ``` The timeouts didn't help at all. ### What you think should happen instead - The task should have either not been scheduled or marked as failed after some period. - Scheduling a task should be an atomic operation, and once it is scheduled it should be able to resolve itself, either as a failure or as a success. - Execution timeout should have been able to catch this failure at most after that time. ### How to reproduce I imagine this would be a very tricky setup to reproduce, but here's what I think happens: - Run airflow - Right in the beginning of scheduling, during that second, kill the DB connection - Hopefully, the same problem should occur ### Operating System Debian GNU/Linux 11 (bullseye) ### Versions of Apache Airflow Providers apache-airflow-providers-amazon==5.0.0 apache-airflow-providers-celery==3.0.0 apache-airflow-providers-cncf-kubernetes==4.3.0 apache-airflow-providers-common-sql==1.1.0 apache-airflow-providers-docker==3.1.0 apache-airflow-providers-elasticsearch==4.2.0 apache-airflow-providers-ftp==3.1.0 apache-airflow-providers-google==8.3.0 apache-airflow-providers-grpc==3.0.0 apache-airflow-providers-hashicorp==3.1.0 apache-airflow-providers-http==4.0.0 apache-airflow-providers-imap==3.0.0 apache-airflow-providers-microsoft-azure==4.2.0 apache-airflow-providers-microsoft-mssql==2.0.1 apache-airflow-providers-mysql==3.2.0 apache-airflow-providers-odbc==3.1.1 apache-airflow-providers-postgres==5.2.0 apache-airflow-providers-redis==3.0.0 apache-airflow-providers-sendgrid==3.0.0 apache-airflow-providers-sftp==4.0.0 apache-airflow-providers-slack==5.1.0 apache-airflow-providers-snowflake==2.7.0 apache-airflow-providers-sqlite==3.2.0 apache-airflow-providers-ssh==2.3.0 ### Deployment Other 3rd-party Helm chart ### Deployment details Nothing, just Kubernetes + DigitalOcean Postgres with PgBouncer managed by DigitalOcean. ### Anything else This occurred on multiple pipelines on the same day at the same time. ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
