karakanb opened a new issue, #27100:
URL: https://github.com/apache/airflow/issues/27100

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   I have been experiencing the tasks are getting stuck in running state 
forever when the DB is unreachable during the scheduling period for some 
reason. 
   
   My setup:
   - Airflow v2.3.4 on Kubernetes
   - Postgres as the DB + pgbouncer in front of it, managed by DigitalOcean
   
   Here's a how it happened:
   - I have some hourly pipelines, all running various reporting tasks.
   - Right around the beginning of the hour on Oct 5th, around 10AM UTC, the 
database became unreachable for some reason.
   - The worker and scheduler instances start noticing the problem, the 
earliest 10:00:07 UTC.
   - The issue takes ~2 mins to resolve.
   - Around that time, a Dagrun is started for one of the hourly pipelines:
   <img width="1489" alt="image" 
src="https://user-images.githubusercontent.com/16530606/196298566-f98d21f9-0e66-4563-94d8-2e59e5a5ef4d.png";>
   
   This task instance gets stuck until we have cleared it out 12 days later, 
today. 
   
   I have all the logs from all the Airflow components, but there is no mention 
of this specific tasks in the logs. There are many failure notifications. I'll 
share all the logs below, and in there you'll see some "ops-tracker` instances, 
but **they are not from the instance that is stuck, they belong to a different 
pipeline**. There is no logs / mention of the task instance that was stuck, 
nothing at all.
   
   In the end, the task instance got stuck in a "running" state for 12 days, 
and it prevented running other dagruns because I use 
`AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG` set to `1` intentionally.
   
   The pipeline has the following default args:
   ```python
   default_args = {
           "depends_on_past": False,
           "retries": 3,
           "retry_delay": timedelta(seconds=30),
           "trigger_rule": TriggerRule.NONE_FAILED,
           "execution_timeout": timedelta(hours=24),
       }
   ```
   
   The timeouts didn't help at all.
   
   
   
   
   
   ### What you think should happen instead
   
   - The task should have either not been scheduled or marked as failed after 
some period. 
   - Scheduling a task should be an atomic operation, and once it is scheduled 
it should be able to resolve itself, either as a failure or as a success.
   - Execution timeout should have been able to catch this failure at most 
after that time.
   
   ### How to reproduce
   
   I imagine this would be a very tricky setup to reproduce, but here's what I 
think happens:
   - Run airflow
   - Right in the beginning of scheduling, during that second, kill the DB 
connection
   - Hopefully, the same problem should occur
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==5.0.0
   apache-airflow-providers-celery==3.0.0
   apache-airflow-providers-cncf-kubernetes==4.3.0
   apache-airflow-providers-common-sql==1.1.0
   apache-airflow-providers-docker==3.1.0
   apache-airflow-providers-elasticsearch==4.2.0
   apache-airflow-providers-ftp==3.1.0
   apache-airflow-providers-google==8.3.0
   apache-airflow-providers-grpc==3.0.0
   apache-airflow-providers-hashicorp==3.1.0
   apache-airflow-providers-http==4.0.0
   apache-airflow-providers-imap==3.0.0
   apache-airflow-providers-microsoft-azure==4.2.0
   apache-airflow-providers-microsoft-mssql==2.0.1
   apache-airflow-providers-mysql==3.2.0
   apache-airflow-providers-odbc==3.1.1
   apache-airflow-providers-postgres==5.2.0
   apache-airflow-providers-redis==3.0.0
   apache-airflow-providers-sendgrid==3.0.0
   apache-airflow-providers-sftp==4.0.0
   apache-airflow-providers-slack==5.1.0
   apache-airflow-providers-snowflake==2.7.0
   apache-airflow-providers-sqlite==3.2.0
   apache-airflow-providers-ssh==2.3.0
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Deployment details
   
   Nothing, just Kubernetes + DigitalOcean Postgres with PgBouncer managed by 
DigitalOcean.
   
   ### Anything else
   
   This occurred on multiple pipelines on the same day at the same time.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to