[GitHub] [airflow] liuning89757 commented on issue #27300: Scheduler encounters database update error, then gets stuck in endless loop, yet still shows as healthy

via GitHub Thu, 02 Feb 2023 00:56:04 -0800


liuning89757 commented on issue #27300:
URL: https://github.com/apache/airflow/issues/27300#issuecomment-1413361554


   Hi @potiuk @ashb @ephraimbuddy ,  hopes my findings may offer some clues 
about this
   I have met similar issue, almost each time when a dynamic task expand to 30+ 
indexes. (Airflow 2.4.3 and MariaDB 10.3.27)
   The traceback:
   ```
   Traceback (most recent call last):
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/local_executor.py", 
line 126, in _execute_work_in_fork
       args.func(args)
     File "/usr/local/lib/python3.9/site-packages/airflow/cli/cli_parser.py", 
line 52, in command
       return func(*args, **kwargs)
     File "/usr/local/lib/python3.9/site-packages/airflow/utils/cli.py", line 
103, in wrapper
       return f(*args, **kwargs)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", 
line 382, in task_run
       _run_task_by_selected_method(args, dag, ti)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", 
line 189, in _run_task_by_selected_method
       _run_task_by_local_task_job(args, ti)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/cli/commands/task_command.py", 
line 247, in _run_task_by_local_task_job
       run_job.run()
     File "/usr/local/lib/python3.9/site-packages/airflow/jobs/base_job.py", 
line 247, in run
       self._execute()
     File 
"/usr/local/lib/python3.9/site-packages/airflow/jobs/local_task_job.py", line 
132, in _execute
       self.handle_task_exit(return_code)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/jobs/local_task_job.py", line 
163, in handle_task_exit
       self.task_instance.schedule_downstream_tasks()
     File "/usr/local/lib/python3.9/site-packages/airflow/utils/session.py", 
line 75, in wrapper
       return func(*args, session=session, **kwargs)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 
2611, in schedule_downstream_tasks
       info = dag_run.task_instance_scheduling_decisions(session)
     File "/usr/local/lib/python3.9/site-packages/airflow/utils/session.py", 
line 72, in wrapper
       return func(*args, **kwargs)
     File "/usr/local/lib/python3.9/site-packages/airflow/models/dagrun.py", 
line 696, in task_instance_scheduling_decisions
       schedulable_tis, changed_tis, expansion_happened = self._get_ready_tis(
     File "/usr/local/lib/python3.9/site-packages/airflow/models/dagrun.py", 
line 755, in _get_ready_tis
       expanded_tis, _ = schedulable.task.expand_mapped_task(self.run_id, 
session=session)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/models/mappedoperator.py", line 
719, in expand_mapped_task
       session.flush()
     File "/usr/local/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", 
line 3444, in flush
       self._flush(objects)
     File "/usr/local/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", 
line 3584, in _flush
       transaction.rollback(_capture_exception=True)
     File 
"/usr/local/lib64/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 
70, in __exit__
       compat.raise_(
     File "/usr/local/lib64/python3.9/site-packages/sqlalchemy/util/compat.py", 
line 210, in raise_
       raise exception
     File "/usr/local/lib64/python3.9/site-packages/sqlalchemy/orm/session.py", 
line 3544, in _flush
       flush_context.execute()
     File 
"/usr/local/lib64/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 
456, in execute
       rec.execute(self)
     File 
"/usr/local/lib64/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 
630, in execute
       util.preloaded.orm_persistence.save_obj(
     File 
"/usr/local/lib64/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 
237, in save_obj
       _emit_update_statements(
     File 
"/usr/local/lib64/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 
1035, in _emit_update_statements
       raise orm_exc.StaleDataError(
   sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'task_instance' 
expected to update 1 row(s); 0 were matched.
   ```
   I turn on general log to capture the SQL when exception happens, found that 
there are two session (537138 and 535083) execute the same statement `UPDATE 
task_instance SET map_index=0 ... where ... AND task_instance.map_index = -1`, 
537138 commit but 535083 rollback then raise the StaleDataError.
   ```
   ...
   537138 Query UPDATE task_instance SET state='removed' WHERE 
task_instance.dag_id = 'new_monitor_server' AND task_instance.task_id = 
'update_server' AND task_instance.run_id = 
'scheduled__2023-02-01T08:35:00+00:00' AND task_instance.map_index >= 37
                537138 Query    UPDATE task_instance SET map_index=0 WHERE 
task_instance.dag_id = 'new_monitor_server' AND task_instance.task_id = 
'update_server' AND task_instance.run_id = 
'scheduled__2023-02-01T08:35:00+00:00' AND task_instance.map_index = -1
                537138 Query    INSERT INTO task_instance (task_id, dag_id, 
run_id, map_index, start_date, end_date, duration, state, try_number, 
max_tries, hostname, unixname, job_id, pool, pool_slots, queue, 
priority_weight, operator, queued_dttm, queued_by_job_id, pid, executor_config, 
external_executor_id, trigger_id, trigger_timeout, next_method, next_kwargs) 
VALUES ('update_server', 'new_monitor_server', 
'scheduled__2023-02-01T08:35:00+00:00', 1, NULL, NULL, NULL, NULL, 0, 0, '', 
'root', NULL, 'default_pool', 1, 'default', 1, '_PythonDecoratedOperator', 
NULL, NULL, NULL, '€}”.', NULL, NULL, NULL, NULL, 'null'),('update_server', 
'new_monitor_server', 'scheduled__2023-02-01T08:35:00+00:00', 2, NULL, NULL, 
NULL, NULL, 0, 0, '', 'root', NULL, 'default_pool', 1, 'default', 1, 
'_PythonDecoratedOperator', NULL, NULL, NULL, '€}”.', NULL, NULL, NULL, NULL, 
'null'),('update_server', 'new_monitor_server', 
'scheduled__2023-02-01T08:35:00+00:00', 3, NULL, NULL, NULL, NULL, 0, 0, '', 
'root', NUL
 L, 'default_pool', 1, 'default', 1, 
   ......
   535083 Query UPDATE task_instance SET state='removed' WHERE 
task_instance.dag_id = 'new_monitor_server' AND task_instance.task_id = 
'update_server' AND task_instance.run_id = 
'scheduled__2023-02-01T08:35:00+00:00' AND task_instance.map_index >= 37
                535083 Query    UPDATE dag_run SET 
last_scheduling_decision='2023-02-01 08:40:01.318630' WHERE dag_run.id = 4320
                535083 Query    UPDATE task_instance SET map_index=0 WHERE 
task_instance.dag_id = 'new_monitor_server' AND task_instance.task_id = 
'update_server' AND task_instance.run_id = 
'scheduled__2023-02-01T08:35:00+00:00' AND task_instance.map_index = -1
                535083 Query    ROLLBACK
                535083 Query    ROLLBACK
   ```
   
   After dig into the code in `airflow/models/mappedoperator.py` function 
`expand_mapped_task`,  i think there is a race condition in this section (maybe 
 mini scheduler and the main scheduler ? )  `unmapped_ti` is not use  lock 
`with_for_update()`, when two thread concurrently execute ` 
unmapped_ti.map_index = 0`, the first thread commit but the second thread call 
`flush()` may raise this exception.
   ```
   unmapped_ti: TaskInstance | None = (
               session.query(TaskInstance)
               .filter(
                   TaskInstance.dag_id == self.dag_id,
                   TaskInstance.task_id == self.task_id,
                   TaskInstance.run_id == run_id,
                   TaskInstance.map_index == -1,
                   or_(TaskInstance.state.in_(State.unfinished), 
TaskInstance.state.is_(None)),
               )
               .one_or_none()
           )
   
           all_expanded_tis: list[TaskInstance] = []
   
           if unmapped_ti:
               # The unmapped task instance still exists and is unfinished, 
i.e. we
               # haven't tried to run it before.
               if total_length is None:
                   if self.dag and self.dag.partial:
                       # If the DAG is partial, it's likely that the upstream 
tasks
                       # are not done yet, so we do nothing
                       indexes_to_map: Iterable[int] = ()
                   else:
                       # If the map length cannot be calculated (due to 
unavailable
                       # upstream sources), fail the unmapped task.
                       unmapped_ti.state = TaskInstanceState.UPSTREAM_FAILED
                       indexes_to_map = ()
               elif total_length < 1:
                   # If the upstream maps this to a zero-length value, simply 
mark
                   # the unmapped task instance as SKIPPED (if needed).
                   self.log.info(
                       "Marking %s as SKIPPED since the map has %d values to 
expand",
                       unmapped_ti,
                       total_length,
                   )
                   unmapped_ti.state = TaskInstanceState.SKIPPED
                   indexes_to_map = ()
               else:
                   # Otherwise convert this into the first mapped index, and 
create
                   # TaskInstance for other indexes.
                   unmapped_ti.map_index = 0
                   self.log.debug("Updated in place to become %s", unmapped_ti)
                   all_expanded_tis.append(unmapped_ti)
                   indexes_to_map = range(1, total_length)
   ```
   
   After change `unmapped_ti`  to `with_for_update().one_or_none()`, the 
exception not reappear  recently. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] liuning89757 commented on issue #27300: Scheduler encounters database update error, then gets stuck in endless loop, yet still shows as healthy

Reply via email to