yuqian90 opened a new pull request #14048:
URL: https://github.com/apache/airflow/pull/14048
Clearing large number of tasks takes a long time. Most of the time is spent
at this line in `clear_task_instances` (more than 95% time). This slowness
sometimes causes the webserver to timeout because the
`web_server_worker_timeout` is hit.
```python
# Clear all reschedules related to the ti to clear
session.query(TR).filter(
TR.dag_id == ti.dag_id,
TR.task_id == ti.task_id,
TR.execution_date == ti.execution_date,
TR.try_number == ti.try_number,
).delete()
```
This line is very slow because it's deleting `TaskReschedule` rows in a for
loop one by one.
This PR simply changes this code to delete`TaskReschedule` in a single sql
query with a bunch of `AND` conditions. It's effectively doing the same, but
now it's much faster. Simple profiling shows that it's at least seven times
faster when deleting thousands of `TaskReschedule`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]