yuqian90 commented on a change in pull request #14048:
URL: https://github.com/apache/airflow/pull/14048#discussion_r572006019
##########
File path: airflow/models/taskinstance.py
##########
@@ -166,13 +167,23 @@ def clear_task_instances(
ti.max_tries = max(ti.max_tries, ti.prev_attempted_tries)
ti.state = State.NONE
session.merge(ti)
+
+ tr_filter.append((ti.dag_id, ti.task_id, ti.execution_date,
ti.try_number))
+
+ if tr_filter:
# Clear all reschedules related to the ti to clear
- session.query(TR).filter(
- TR.dag_id == ti.dag_id,
- TR.task_id == ti.task_id,
- TR.execution_date == ti.execution_date,
- TR.try_number == ti.try_number,
- ).delete()
+ delete_qry = TR.__table__.delete().where(
+ or_(
+ and_(
+ TR.dag_id == dag_id,
+ TR.task_id == task_id,
+ TR.execution_date == execution_date,
+ TR.try_number == try_number,
+ )
+ for dag_id, task_id, execution_date, try_number in tr_filter
Review comment:
Thanks. That's a great idea. I looked into the optimization in
`filter_for_tis`. It works by extracting the common `dag_id` and
`execution_date`, and then filter the `task_id` with an `IN` statement. This
greatly reduces the size of the query, but it only does so if both `dag_id` and
`execution_date` are equal for all tis.
For the case of slow `clear_task_instances`, the large number of tis is
usually produced by the combination of a single dag_id over multiple
`execution_date` for a large dag containing many task_ids. So it's slightly
more complicated because there can be multiple `dag_id` and `execution_date`
combinations (as well as `try_number`).
So I updated the logic to be more generic. It now uses a nested dictionary
to produce the hierarchical filter. Some profiling showed great speed
improvement (something like 40 to 50 times faster) compared to the first
iteration. So the overral performance should now be 300 times faster than the
original for loop deletion.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]