NickYadance opened a new issue, #26237:
URL: https://github.com/apache/airflow/issues/26237

   ### Apache Airflow version
   
   Other Airflow 2 version
   
   ### What happened
   
   Scheduler got crashed soon after reboot.
   ```
   [SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND 
task_instance.task_id = %s AND task_instance.run_id = %s AND 
task_instance.map_index = %s]
   [parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, 'xx', 'xx', 
'scheduled__2022-09-07T13:05:00+00:00', -1)]
   (Background on this error at: http://sqlalche.me/e/14/e3q8)
   [2022-09-08 07:50:45,270] {kubernetes_executor.py:813} INFO - Shutting down 
Kubernetes executor
   [2022-09-08 07:50:45,292] {manager.py:341} WARNING - Ending without manager 
process.
   [2022-09-08 07:50:45,292] {scheduler_job.py:768} INFO - Exited execute loop
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", 
line 1706, in _execute_context
       cursor, statement, parameters, context
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py",
 line 716, in do_execute
       cursor.execute(statement, parameters)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 
206, in execute
       res = self._query(query)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 
319, in _query
       db.query(q)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 
254, in query
       _mysql.connection.query(self, query)
   MySQLdb._exceptions.OperationalError: (1205, 'Lock wait timeout exceeded; 
try restarting transaction')
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/share/miniconda2/envs/airflow/bin/airflow", line 8, in <module>
       sys.exit(main())
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/__main__.py", line 
38, in main
       args.func(args)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", 
line 51, in command
       return func(*args, **kwargs)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/cli.py", line 
99, in wrapper
       return f(*args, **kwargs)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py",
 line 75, in scheduler
       _run_scheduler_job(args=args)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py",
 line 46, in _run_scheduler_job
       job.run()
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/base_job.py", 
line 244, in run
       self._execute()
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py",
 line 730, in _execute
       self.executor.start()
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py",
 line 82, in start
       self.kubernetes_executor.start()
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
 line 539, in start
       self.clear_not_launched_queued_tasks()
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/session.py", 
line 71, in wrapper
       return func(*args, session=session, **kwargs)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
 line 515, in clear_not_launched_queued_tasks
       ).update({TaskInstance.state: State.SCHEDULED})
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", 
line 3196, in update
       execution_options={"synchronize_session": synchronize_session},
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", 
line 1670, in execute
       result = conn._execute_20(statement, params or {}, execution_options)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", 
line 1520, in _execute_20
       return meth(self, args_10style, kwargs_10style, execution_options)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", 
line 314, in _execute_on_connection
       self, multiparams, params, execution_options
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", 
line 1399, in _execute_clauseelement
       cache_hit=cache_hit,
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", 
line 1749, in _execute_context
       e, statement, parameters, cursor, context
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", 
line 1930, in _handle_dbapi_exception
       sqlalchemy_exception, with_traceback=exc_info[2], from_=e
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", 
line 211, in raise_
       raise exception
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", 
line 1706, in _execute_context
       cursor, statement, parameters, context
     File 
"/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py",
 line 716, in do_execute
       cursor.execute(statement, parameters)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 
206, in execute
       res = self._query(query)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/cursors.py", line 
319, in _query
       db.query(q)
     File 
"/home/airflow/.local/lib/python3.7/site-packages/MySQLdb/connections.py", line 
254, in query
       _mysql.connection.query(self, query)
   sqlalchemy.exc.OperationalError: (MySQLdb._exceptions.OperationalError) 
(1205, 'Lock wait timeout exceeded; try restarting transaction')
   [SQL: UPDATE task_instance SET state=%s WHERE task_instance.dag_id = %s AND 
task_instance.task_id = %s AND task_instance.run_id = %s AND 
task_instance.map_index = %s]
   [parameters: (<TaskInstanceState.SCHEDULED: 'scheduled'>, 'xx', 'xx', 
'scheduled__2022-09-07T13:05:00+00:00', -1)]
   (Background on this error at: http://sqlalche.me/e/14/e3q8)
   ```
   
   Database engine status shows that some select query is holding locks of the 
table `task_instance`, causing all the updates to that table hang and timeout. 
   
   ```
   LATEST DETECTED DEADLOCK
   ------------------------
   2022-09-08 06:31:35 139737395513088
   *** (1) TRANSACTION:
   TRANSACTION 4413895189, ACTIVE 67 sec starting index read
   mysql tables in use 2, locked 2
   LOCK WAIT 7 lock struct(s), heap size 1128, 3 row lock(s)
   MySQL thread id 1330145004, OS thread handle 139724943472384, query id 
66415927240 10.244.2.0 airflow statistics
   SELECT task_instance.try_number AS task_instance_try_number, 
task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS 
task_instance_dag_id, task_instance.run_id AS task_instance_run_id, 
task_instance.map_index AS task_instance_map_index, task_instance.start_date AS 
task_instance_start_date, task_instance.end_date AS task_instance_end_date, 
task_instance.duration AS task_instance_duration, task_instance.state AS 
task_instance_state, task_instance.max_tries AS task_instance_max_tries, 
task_instance.hostname AS task_instance_hostname, task_instance.unixname AS 
task_instance_unixname, task_instance.job_id AS task_instance_job_id, 
task_instance.pool AS task_instance_pool, task_instance.pool_slots AS 
task_instance_pool_slots, task_instance.queue AS task_instance_queue, 
task_instance.priority_weight AS task_instance_priority_weight, 
task_instance.operator AS task_instance_operator, task_in
   
   *** (1) HOLDS THE LOCK(S):
   RECORD LOCKS space id 522 page no 542731 n bits 120 index PRIMARY of table 
`airflow`.`task_instance` trx id 4413895189 lock_mode X locks rec but not gap
   Record lock, heap no 52 PHYSICAL RECORD: n_fields 29; compact format; info 
bits 0
    0: len 30; hex 
7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 
bytes);
    1: len 8; hex 747261696e5f4d59; asc xx;;
    2: len 30; hex 
7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc 
scheduled__2022-09-05T17:00:00; (total 36 bytes);
    3: len 4; hex 7fffffff; asc     ;;
    4: len 6; hex 000107166d96; asc     m ;;
    5: len 7; hex 01000000dc188c; asc        ;;
    6: len 7; hex 631989a00430e4; asc c    0 ;;
    7: SQL NULL;
    8: SQL NULL;
    9: len 7; hex 72756e6e696e67; asc running;;
    10: len 4; hex 80000002; asc     ;;
    11: len 30; hex 
73672d73686f7065652d7265636f6d6d656e642d7265636f6d6d656e642d; asc xx; (total 47 
bytes);
    12: len 4; hex 726f6f74; asc root;;
    13: len 4; hex 804e4728; asc  NG(;;
    14: len 12; hex 64656661756c745f706f6f6c; asc default_pool;;
    15: len 7; hex 64656661756c74; asc default;;
    16: len 4; hex 80000003; asc     ;;
    17: len 12; hex 426173684f70657261746f72; asc BashOperator;;
    18: len 7; hex 6319892f08ff4f; asc c  /  O;;
    19: len 4; hex 80000060; asc    `;;
    20: len 4; hex 80000004; asc     ;;
    21: len 5; hex 80047d942e; asc   } .;;
    22: len 4; hex 80000001; asc     ;;
    23: len 4; hex 804e478a; asc  NG ;;
    24: SQL NULL;
    25: SQL NULL;
    26: SQL NULL;
    27: SQL NULL;
    28: SQL NULL;
   
   
   *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
   RECORD LOCKS space id 520 page no 21486 n bits 112 index PRIMARY of table 
`airflow`.`dag_run` trx id 4413895189 lock_mode X locks rec but not gap waiting
   Record lock, heap no 45 PHYSICAL RECORD: n_fields 19; compact format; info 
bits 0
    0: len 4; hex 8005cdf8; asc     ;;
    1: len 6; hex 00010716a9ca; asc       ;;
    2: len 7; hex 02000002b72669; asc      &i;;
    3: len 30; hex 
7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 
bytes);
    4: len 7; hex 63162b10000000; asc c +    ;;
    5: len 6; hex 717565756564; asc queued;;
    6: len 30; hex 
7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc 
scheduled__2022-09-05T17:00:00; (total 36 bytes);
    7: len 1; hex 80; asc  ;;
    8: len 5; hex 80047d942e; asc   } .;;
    9: SQL NULL;
    10: SQL NULL;
    11: len 9; hex 7363686564756c6564; asc scheduled;;
    12: SQL NULL;
    13: len 30; hex 
316333653765323163363664396663356365326434373835343466323665; asc 
1c3e7e21c66d9fc5ce2d478544f26e; (total 32 bytes);
    14: len 4; hex 804ba58d; asc  K  ;;
    15: len 7; hex 63198bfe0195b2; asc c      ;;
    16: len 7; hex 63162b10000000; asc c +    ;;
    17: len 7; hex 63163920000000; asc c 9    ;;
    18: len 4; hex 80000001; asc     ;;
   
   
   *** (2) TRANSACTION:
   TRANSACTION 4413893066, ACTIVE 73 sec starting index read
   mysql tables in use 1, locked 1
   LOCK WAIT 7 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 2
   MySQL thread id 1329985529, OS thread handle 139729063040768, query id 
66416312012 10.244.0.215 airflow updating
   UPDATE task_instance SET state='restarting' WHERE task_instance.task_id = 
'xx' AND task_instance.dag_id = 'xx' AND task_instance.run_id = 
'scheduled__2022-09-05T17:00:00+00:00' AND task_instance.map_index = -1
   
   *** (2) HOLDS THE LOCK(S):
   RECORD LOCKS space id 520 page no 21486 n bits 112 index PRIMARY of table 
`airflow`.`dag_run` trx id 4413893066 lock_mode X locks rec but not gap
   Record lock, heap no 45 PHYSICAL RECORD: n_fields 19; compact format; info 
bits 0
    0: len 4; hex 8005cdf8; asc     ;;
    1: len 6; hex 00010716a9ca; asc       ;;
    2: len 7; hex 02000002b72669; asc      &i;;
    3: len 30; hex 
7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 
bytes);
    4: len 7; hex 63162b10000000; asc c +    ;;
    5: len 6; hex 717565756564; asc queued;;
    6: len 30; hex 
7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc 
scheduled__2022-09-05T17:00:00; (total 36 bytes);
    7: len 1; hex 80; asc  ;;
    8: len 5; hex 80047d942e; asc   } .;;
    9: SQL NULL;
    10: SQL NULL;
    11: len 9; hex 7363686564756c6564; asc scheduled;;
    12: SQL NULL;
    13: len 30; hex 
316333653765323163363664396663356365326434373835343466323665; asc 
1c3e7e21c66d9fc5ce2d478544f26e; (total 32 bytes);
    14: len 4; hex 804ba58d; asc  K  ;;
    15: len 7; hex 63198bfe0195b2; asc c      ;;
    16: len 7; hex 63162b10000000; asc c +    ;;
    17: len 7; hex 63163920000000; asc c 9    ;;
    18: len 4; hex 80000001; asc     ;;
   
   
   *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
   RECORD LOCKS space id 522 page no 542731 n bits 120 index PRIMARY of table 
`airflow`.`task_instance` trx id 4413893066 lock_mode X locks rec but not gap 
waiting
   Record lock, heap no 52 PHYSICAL RECORD: n_fields 29; compact format; info 
bits 0
    0: len 30; hex 
7064705f737061727365646e6e5f70647076355f6d756c74697461736b5f; asc xx; (total 64 
bytes);
    1: len 8; hex 747261696e5f4d59; asc xx;;
    2: len 30; hex 
7363686564756c65645f5f323032322d30392d30355431373a30303a3030; asc 
scheduled__2022-09-05T17:00:00; (total 36 bytes);
    3: len 4; hex 7fffffff; asc     ;;
    4: len 6; hex 000107166d96; asc     m ;;
    5: len 7; hex 01000000dc188c; asc        ;;
    6: len 7; hex 631989a00430e4; asc c    0 ;;
    7: SQL NULL;
    8: SQL NULL;
    9: len 7; hex 72756e6e696e67; asc running;;
    10: len 4; hex 80000002; asc     ;;
    11: len 30; hex 
73672d73686f7065652d7265636f6d6d656e642d7265636f6d6d656e642d; asc xx; (total 47 
bytes);
    12: len 4; hex 726f6f74; asc root;;
    13: len 4; hex 804e4728; asc  NG(;;
    14: len 12; hex 64656661756c745f706f6f6c; asc default_pool;;
    15: len 7; hex 64656661756c74; asc default;;
    16: len 4; hex 80000003; asc     ;;
    17: len 12; hex 426173684f70657261746f72; asc BashOperator;;
    18: len 7; hex 6319892f08ff4f; asc c  /  O;;
    19: len 4; hex 80000060; asc    `;;
    20: len 4; hex 80000004; asc     ;;
    21: len 5; hex 80047d942e; asc   } .;;
    22: len 4; hex 80000001; asc     ;;
    23: len 4; hex 804e478a; asc  NG ;;
    24: SQL NULL;
    25: SQL NULL;
    26: SQL NULL;
    27: SQL NULL;
    28: SQL NULL;
   ```
   
   Noticed that the query has the sign of `airflow statistics` so i tried 
shutdown statsd (delete k8s deployments) blindly and it surprisingly worked. 
After that the scheduler reboot and run normally.  
   
   ### What you think should happen instead
   
   _No response_
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian 
PRETTY_NAME="Ubuntu 20.04.1 LTS" VERSION_ID="20.04" 
HOME_URL="https://www.ubuntu.com/"; SUPPORT_URL="https://help.ubuntu.com/"; 
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"; 
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy";
 VERSION_CODENAME=focal UBUNTU_CODENAME=focal
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to