amoGLingle opened a new issue, #24909:
URL: https://github.com/apache/airflow/issues/24909
### Apache Airflow version
2.1.0
### What happened
We have been running Airflow 2.1.0 with Scheduler HA for about 8 months,
having upgraded from 1.8. Recently (last 3/4 months) we've encountered the
situation where the Schedulers Lock up with no tasks running.
Symptom:
No tasks getting run. Nothing running at all. Restarted workers, no luck.
Looked at scheduler logs on 2 schedulers (syslogs) and saw numerous entries
like:
{code}
[root@af2-dod-prod-master1 centos]# cat /var/log/messages | grep "list index"
Mar 29 03:10:03 af2-dod-prod-master1 scl: list index out of range#033[0m
Mar 29 03:10:05 af2-dod-prod-master1 scl: list index out of range#033[0m
--
Mar 29 03:10:23 af2-dod-prod-master1 scl: [#033[34m2022-03-29
03:10:23,672#033[0m] {#033[34mcelery_executor.py:#033[0m295} ERROR#033[0m -
Error sending Celery task: This Session's transaction has been rolled back due
to a previous exception during flush. To begin a new transaction with this
Session, first issue Session.rollback(). Original exception was: Timeout, PID:
15673 (Background on this error at: http://sqlalche.me/e/13/7s2a)
Mar 29 03:10:23 af2-dod-prod-master1 scl: Celery Task ID:
TaskInstanceKey(dag_id='dod_dsp_audience_edge',
task_id='emit_datamine_druid_delay_to_influxdb',
execution_date=datetime.datetime(2022, 3, 28, 20, 0, tzinfo=Timezone('UTC')),
try_number=1)
--
Mar 29 03:10:03 af2-dod-prod-master1 scl: [#033[34m2022-03-29
03:10:03,639#033[0m] {#033[34mdagrun.py:#033[0m429} ERROR#033[0m - Marking run
<DagRun dod_queue_execution_monitor_worker4 @ 2022-03-29 03:05:00+00:00:
scheduled__2022-03-29T03:05:00+00:00, externally triggered: False> failed#033[0m
Mar 29 03:10:03 af2-dod-prod-master1 scl: [#033[34m2022-03-29
03:10:03,639#033[0m] {#033[34mdagrun.py:#033[0m608} WARNING#033[0m - Failed to
record first_task_scheduling_delay metric:
Mar 29 03:10:03 af2-dod-prod-master1 scl: list index out of range#033[0m
--
Mar 29 03:10:01 af2-dod-prod-master1 scl: [#033[34m2022-03-29
03:10:01,631#033[0m] {#033[34mcelery_executor.py:#033[0m295} ERROR#033[0m -
Error sending Celery task: This Session's transaction has been rolled back due
to a previous exception during flush. To begin a new transaction with this
Session, first issue Session.rollback(). Original exception was: Timeout, PID:
15673 (Background on this error at: http://sqlalche.me/e/13/7s2a)
Mar 29 03:10:01 af2-dod-prod-master1 scl: Celery Task ID:
TaskInstanceKey(dag_id='dod_sync_monitor', task_id='load_dod_sync_post_data',
execution_date=datetime.datetime(2022, 3, 29, 3, 5, tzinfo=Timezone('UTC')),
try_number=1)
{code}
which seems a bug in airflow or celery - the documentation at
http://sqlalche.me/e/13/7s2a says that this happens when an app improperly
ignores a transaction exception and doesn’t roll back. Further explanation at
https://docs.sqlalchemy.org/en/13/faq/sessions.html#faq-session-rollback
A prior AIRFLOW jira shows this has been seen before:
https://issues.apache.org/jira/browse/AIRFLOW-6202?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20%22This%20Session%27s%20transaction%20has%20been%20rolled%20back%20due%20to%20a%20previous%20exception%20during%20flush.%22
We have encountered this issue 3 times in past ~4 months: twice on PROD
cluster and once in the QA one.
### What you think should happen instead
Schedulers should not hang due to locked transaction. Tasks should keep
executing.
As my description above says, pointing out the relevant celery
documentation, there seems to be a point in the code where the transaction
isn't rolled back when it should be.
### How to reproduce
I have no idea how to reproduce. This happens during normal course of
running dags.
### Operating System
Centos Linux 7
### Versions of Apache Airflow Providers
prod-master1 centos]# pip list
apache-airflow 2.1.0
apache-airflow-providers-apache-druid 2.0.0
apache-airflow-providers-apache-livy 2.0.0
apache-airflow-providers-cncf-kubernetes 2.0.0
apache-airflow-providers-ftp 1.1.0
apache-airflow-providers-http 2.0.0
apache-airflow-providers-imap 1.0.1
apache-airflow-providers-mysql 2.0.0
apache-airflow-providers-postgres 2.0.0
apache-airflow-providers-snowflake 2.1.0
apache-airflow-providers-sqlite 1.0.2
### Deployment
Other
### Deployment details
Manual hand deploy following instructions on Airflow website.
### Anything else
This seems to occur only once every few months. When it does, our production
dags just lock up. We have monitoring dags for each queue we have. Each
runs a small a single task that pushes to influx/grafana and grafana alerting
to pagerduty alerting when such lockups occur (or other issues as well, like
networking outages, task runners down).
The description above shows logs with ERROR and pointer to where the issue
might be: possibly not rolling back transaction in an exception.
Hope this can be (or has already been) found and fixed.
Thank You.
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]