amoGLingle opened a new issue, #24909:
URL: https://github.com/apache/airflow/issues/24909

   ### Apache Airflow version
   
   2.1.0
   
   ### What happened
   
   We have been running Airflow 2.1.0 with Scheduler HA for about 8 months, 
having upgraded from 1.8.  Recently (last 3/4 months) we've encountered the 
situation where the Schedulers Lock up with no tasks running.
   
   Symptom:
   No tasks getting run.  Nothing running at all.  Restarted workers, no luck.
   
   Looked at scheduler logs on 2 schedulers (syslogs) and saw numerous entries 
like:
   {code}
   [root@af2-dod-prod-master1 centos]# cat /var/log/messages | grep "list index"
   Mar 29 03:10:03 af2-dod-prod-master1 scl: list index out of range#033[0m
   Mar 29 03:10:05 af2-dod-prod-master1 scl: list index out of range#033[0m
   --
   Mar 29 03:10:23 af2-dod-prod-master1 scl: [#033[34m2022-03-29 
03:10:23,672#033[0m] {#033[34mcelery_executor.py:#033[0m295} ERROR#033[0m - 
Error sending Celery task: This Session's transaction has been rolled back due 
to a previous exception during flush. To begin a new transaction with this 
Session, first issue Session.rollback(). Original exception was: Timeout, PID: 
15673 (Background on this error at: http://sqlalche.me/e/13/7s2a)
   Mar 29 03:10:23 af2-dod-prod-master1 scl: Celery Task ID: 
TaskInstanceKey(dag_id='dod_dsp_audience_edge', 
task_id='emit_datamine_druid_delay_to_influxdb', 
execution_date=datetime.datetime(2022, 3, 28, 20, 0, tzinfo=Timezone('UTC')), 
try_number=1)
   --
   Mar 29 03:10:03 af2-dod-prod-master1 scl: [#033[34m2022-03-29 
03:10:03,639#033[0m] {#033[34mdagrun.py:#033[0m429} ERROR#033[0m - Marking run 
<DagRun dod_queue_execution_monitor_worker4 @ 2022-03-29 03:05:00+00:00: 
scheduled__2022-03-29T03:05:00+00:00, externally triggered: False> failed#033[0m
   Mar 29 03:10:03 af2-dod-prod-master1 scl: [#033[34m2022-03-29 
03:10:03,639#033[0m] {#033[34mdagrun.py:#033[0m608} WARNING#033[0m - Failed to 
record first_task_scheduling_delay metric:
   Mar 29 03:10:03 af2-dod-prod-master1 scl: list index out of range#033[0m
   --
   Mar 29 03:10:01 af2-dod-prod-master1 scl: [#033[34m2022-03-29 
03:10:01,631#033[0m] {#033[34mcelery_executor.py:#033[0m295} ERROR#033[0m - 
Error sending Celery task: This Session's transaction has been rolled back due 
to a previous exception during flush. To begin a new transaction with this 
Session, first issue Session.rollback(). Original exception was: Timeout, PID: 
15673 (Background on this error at: http://sqlalche.me/e/13/7s2a)
   Mar 29 03:10:01 af2-dod-prod-master1 scl: Celery Task ID: 
TaskInstanceKey(dag_id='dod_sync_monitor', task_id='load_dod_sync_post_data', 
execution_date=datetime.datetime(2022, 3, 29, 3, 5, tzinfo=Timezone('UTC')), 
try_number=1)
   {code}
   which seems a bug in airflow or celery - the documentation at 
http://sqlalche.me/e/13/7s2a says that this happens when an app improperly 
ignores a transaction exception and doesn’t roll back. Further explanation at 
https://docs.sqlalchemy.org/en/13/faq/sessions.html#faq-session-rollback 
   
   A prior AIRFLOW jira shows this has been seen before: 
https://issues.apache.org/jira/browse/AIRFLOW-6202?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20%22This%20Session%27s%20transaction%20has%20been%20rolled%20back%20due%20to%20a%20previous%20exception%20during%20flush.%22
   
   We have encountered this issue 3 times in past ~4 months: twice on PROD 
cluster and once in the QA one.
   
   
   ### What you think should happen instead
   
   Schedulers should not hang due to locked transaction.  Tasks should keep 
executing.
   As my description above says, pointing out the relevant celery 
documentation, there seems to be a point in the code where the transaction 
isn't rolled back when it should be.
   
   ### How to reproduce
   
   I have no idea how to reproduce.  This happens during normal course of 
running dags.
   
   
   ### Operating System
   
   Centos Linux 7
   
   ### Versions of Apache Airflow Providers
   
   prod-master1 centos]# pip list
   apache-airflow                           2.1.0
   apache-airflow-providers-apache-druid    2.0.0
   apache-airflow-providers-apache-livy     2.0.0
   apache-airflow-providers-cncf-kubernetes 2.0.0
   apache-airflow-providers-ftp             1.1.0
   apache-airflow-providers-http            2.0.0
   apache-airflow-providers-imap            1.0.1
   apache-airflow-providers-mysql           2.0.0
   apache-airflow-providers-postgres        2.0.0
   apache-airflow-providers-snowflake       2.1.0
   apache-airflow-providers-sqlite          1.0.2
   
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   Manual hand deploy following instructions on Airflow website.
   
   ### Anything else
   
   This seems to occur only once every few months.  When it does, our production
   dags just lock up.  We have monitoring dags for each queue we have.  Each 
runs a small a single task that pushes to influx/grafana and grafana alerting 
to pagerduty alerting when such lockups occur (or other issues as well, like 
networking outages, task runners down).
   
   The description above shows logs with ERROR and pointer to where the issue 
might be: possibly not rolling back transaction in an exception.
   
   Hope this can be (or has already been) found and fixed.
   
   Thank You.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to