NickYadance opened a new issue, #26237: URL: https://github.com/apache/airflow/issues/26237
### Apache Airflow version Other Airflow 2 version ### What happened Scheduler got crashed soon after reboot. ``` [SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.run_id = %s AND task_instance.map_index = %s] [parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, 'xx', 'xx', 'scheduled__2022-09-07T13:05:00+00:00', -1)] (Background on this error at: http://sqlalche.me/e/14/e3q8) [2022-09-08 07:50:45,270] {kubernetes_executor.py:813} INFO - Shutting down Kubernetes executor [2022-09-08 07:50:45,292] {manager.py:341} WARNING - Ending without manager process. [2022-09-08 07:50:45,292] {scheduler_job.py:768} INFO - Exited execute loop Traceback (most recent call last): File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1706, in _execute_context cursor, statement, parameters, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute cursor.execute(statement, parameters) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute res = self._query(query) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query db.query(q) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query _mysql.connection.query(self, query) MySQLdb._exceptions.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/share/miniconda2/envs/airflow/bin/airflow", line 8, in <module> sys.exit(main()) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/__main__.py", line 38, in main args.func(args) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 51, in command return func(*args, **kwargs) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/cli.py", line 99, in wrapper return f(*args, **kwargs) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 75, in scheduler _run_scheduler_job(args=args) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py", line 46, in _run_scheduler_job job.run() File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 244, in run self._execute() File "/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 730, in _execute self.executor.start() File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py", line 82, in start self.kubernetes_executor.start() File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 539, in start self.clear_not_launched_queued_tasks() File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", line 71, in wrapper return func(*args, session=session, **kwargs) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py", line 515, in clear_not_launched_queued_tasks ).update({TaskInstance.state: State.SCHEDULED}) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3196, in update execution_options={"synchronize_session": synchronize_session}, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1670, in execute result = conn._execute_20(statement, params or {}, execution_options) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1520, in _execute_20 return meth(self, args_10style, kwargs_10style, execution_options) File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 314, in _execute_on_connection self, multiparams, params, execution_options File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1399, in _execute_clauseelement cache_hit=cache_hit, File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1749, in _execute_context e, statement, parameters, cursor, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1930, in _handle_dbapi_exception sqlalchemy_exception, with_traceback=exc_info[2], from_=e File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_ raise exception File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1706, in _execute_context cursor, statement, parameters, context File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 716, in do_execute cursor.execute(statement, parameters) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 206, in execute res = self._query(query) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 319, in _query db.query(q) File "/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 254, in query _mysql.connection.query(self, query) sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.run_id = %s AND task_instance.map_index = %s] [parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, 'xx', 'xx', 'scheduled__2022-09-07T13:05:00+00:00', -1)] (Background on this error at: http://sqlalche.me/e/14/e3q8) ``` Database engine status shows that some select query is holding locks of the table `task_instance`, causing all the updates to that table hang and timeout. ``` LATEST DETECTED DEADLOCK ------------------------ 2022-09-08 06:31:35 139737395513088 *** (1) TRANSACTION: TRANSACTION 4413895189, ACTIVE 67 sec starting index read mysql tables in use 2, locked 2 LOCK WAIT 7 lock struct(s), heap size 1128, 3 row lock(s) MySQL thread id 1330145004, OS thread handle 139724943472384, query id 66415927240 10.244.2.0 airflow statistics SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.run_id AS task_instance_run_id, task_instance.map_index AS task_instance_map_index, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.pool_slots AS task_instance_pool_slots, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_in *** (1) HOLDS THE LOCK(S): RECORD LOCKS space id 522 page no 542731 n bits 120 index PRIMARY of table `airflow`.`task_instance` trx id 4413895189 lock_mode X locks rec but not gap Record lock, heap no 52 PHYSICAL RECORD: n_fields 29; compact format; info bits 0 0: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes); 1: len 8; hex 747261696e5f4d59; asc xx;; 2: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes); 3: len 4; hex 7fffffff; asc ;; 4: len 6; hex 000107166d96; asc m ;; 5: len 7; hex 01000000dc188c; asc ;; 6: len 7; hex 631989a00430e4; asc c 0 ;; 7: SQL NULL; 8: SQL NULL; 9: len 7; hex 72756e6e696e67; asc running;; 10: len 4; hex 80000002; asc ;; 11: len 30; hex 73672d73686f7065652d7265636f6d6d656e642d7265636f6d6d656e642d; asc xx; (total 47 bytes); 12: len 4; hex 726f6f74; asc root;; 13: len 4; hex 804e4728; asc NG(;; 14: len 12; hex 64656661756c745f706f6f6c; asc default_pool;; 15: len 7; hex 64656661756c74; asc default;; 16: len 4; hex 80000003; asc ;; 17: len 12; hex 426173684f70657261746f72; asc BashOperator;; 18: len 7; hex 6319892f08ff4f; asc c / O;; 19: len 4; hex 80000060; asc `;; 20: len 4; hex 80000004; asc ;; 21: len 5; hex 80047d942e; asc } .;; 22: len 4; hex 80000001; asc ;; 23: len 4; hex 804e478a; asc NG ;; 24: SQL NULL; 25: SQL NULL; 26: SQL NULL; 27: SQL NULL; 28: SQL NULL; *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 520 page no 21486 n bits 112 index PRIMARY of table `airflow`.`dag_run` trx id 4413895189 lock_mode X locks rec but not gap waiting Record lock, heap no 45 PHYSICAL RECORD: n_fields 19; compact format; info bits 0 0: len 4; hex 8005cdf8; asc ;; 1: len 6; hex 00010716a9ca; asc ;; 2: len 7; hex 02000002b72669; asc &i;; 3: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes); 4: len 7; hex 63162b10000000; asc c + ;; 5: len 6; hex 717565756564; asc queued;; 6: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes); 7: len 1; hex 80; asc ;; 8: len 5; hex 80047d942e; asc } .;; 9: SQL NULL; 10: SQL NULL; 11: len 9; hex 7363686564756c6564; asc scheduled;; 12: SQL NULL; 13: len 30; hex 316333653765323163363664396663356365326434373835343466323665; asc 1c3e7e21c66d9fc5ce2d478544f26e; (total 32 bytes); 14: len 4; hex 804ba58d; asc K ;; 15: len 7; hex 63198bfe0195b2; asc c ;; 16: len 7; hex 63162b10000000; asc c + ;; 17: len 7; hex 63163920000000; asc c 9 ;; 18: len 4; hex 80000001; asc ;; *** (2) TRANSACTION: TRANSACTION 4413893066, ACTIVE 73 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 7 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 2 MySQL thread id 1329985529, OS thread handle 139729063040768, query id 66416312012 10.244.0.215 airflow updating UPDATE task_instance SET state='restarting' WHERE task_instance.task_id = 'xx' AND task_instance.dag_id = 'xx' AND task_instance.run_id = 'scheduled__2022-09-05T17:00:00+00:00' AND task_instance.map_index = -1 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 520 page no 21486 n bits 112 index PRIMARY of table `airflow`.`dag_run` trx id 4413893066 lock_mode X locks rec but not gap Record lock, heap no 45 PHYSICAL RECORD: n_fields 19; compact format; info bits 0 0: len 4; hex 8005cdf8; asc ;; 1: len 6; hex 00010716a9ca; asc ;; 2: len 7; hex 02000002b72669; asc &i;; 3: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes); 4: len 7; hex 63162b10000000; asc c + ;; 5: len 6; hex 717565756564; asc queued;; 6: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes); 7: len 1; hex 80; asc ;; 8: len 5; hex 80047d942e; asc } .;; 9: SQL NULL; 10: SQL NULL; 11: len 9; hex 7363686564756c6564; asc scheduled;; 12: SQL NULL; 13: len 30; hex 316333653765323163363664396663356365326434373835343466323665; asc 1c3e7e21c66d9fc5ce2d478544f26e; (total 32 bytes); 14: len 4; hex 804ba58d; asc K ;; 15: len 7; hex 63198bfe0195b2; asc c ;; 16: len 7; hex 63162b10000000; asc c + ;; 17: len 7; hex 63163920000000; asc c 9 ;; 18: len 4; hex 80000001; asc ;; *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 522 page no 542731 n bits 120 index PRIMARY of table `airflow`.`task_instance` trx id 4413893066 lock_mode X locks rec but not gap waiting Record lock, heap no 52 PHYSICAL RECORD: n_fields 29; compact format; info bits 0 0: len 30; hex 7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 bytes); 1: len 8; hex 747261696e5f4d59; asc xx;; 2: len 30; hex 7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc scheduled__2022-09-05T17:00:00; (total 36 bytes); 3: len 4; hex 7fffffff; asc ;; 4: len 6; hex 000107166d96; asc m ;; 5: len 7; hex 01000000dc188c; asc ;; 6: len 7; hex 631989a00430e4; asc c 0 ;; 7: SQL NULL; 8: SQL NULL; 9: len 7; hex 72756e6e696e67; asc running;; 10: len 4; hex 80000002; asc ;; 11: len 30; hex 73672d73686f7065652d7265636f6d6d656e642d7265636f6d6d656e642d; asc xx; (total 47 bytes); 12: len 4; hex 726f6f74; asc root;; 13: len 4; hex 804e4728; asc NG(;; 14: len 12; hex 64656661756c745f706f6f6c; asc default_pool;; 15: len 7; hex 64656661756c74; asc default;; 16: len 4; hex 80000003; asc ;; 17: len 12; hex 426173684f70657261746f72; asc BashOperator;; 18: len 7; hex 6319892f08ff4f; asc c / O;; 19: len 4; hex 80000060; asc `;; 20: len 4; hex 80000004; asc ;; 21: len 5; hex 80047d942e; asc } .;; 22: len 4; hex 80000001; asc ;; 23: len 4; hex 804e478a; asc NG ;; 24: SQL NULL; 25: SQL NULL; 26: SQL NULL; 27: SQL NULL; 28: SQL NULL; ``` Noticed that the query has the sign of `airflow statistics` so i tried shutdown statsd (delete k8s deployments) blindly and it surprisingly worked. After that the scheduler reboot and run normally. ### What you think should happen instead _No response_ ### How to reproduce _No response_ ### Operating System NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.1 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal ### Versions of Apache Airflow Providers _No response_ ### Deployment Official Apache Airflow Helm Chart ### Deployment details _No response_ ### Anything else _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
