ashb opened a new pull request #11147:
URL: https://github.com/apache/airflow/pull/11147


   The previous query generated SQL like this:
   
   ```
   WHERE (task_id = ? AND dag_id = ? AND execution_date = ?) OR (task_id = ? 
AND dag_id = ? AND execution_date = ?)
   ```
   
   Which is fine for one or maybe even 100 TIs, but when testing DAGs at
   extreme size (over 21k tasks!) this query was taking for ever (162s on
   Postgres, 172s on MySQL 5.7)
   
   By changing this query to this
   
   ```
   WHERE task_id IN (?,?) AND dag_id = ? AND execution_date = ?
   ```
   
   the time is reduced to 1s! (1.03s on Postgres, 1.19s on MySQL)
   
   Even on 100 tis the reduction is large, but the overall time is not
   significant (0.01451s -> 0.00626s on Postgres).
   
   Times included SQLA query construction time (but not time for calling
   filter_for_tis. So a like-for-like comparison), not just DB query time:
   
   ```python
   ipdb> start_filter_20k = time.monotonic(); result_filter_20k = 
session.query(TI).filter(tis_filter).all(); end_filter_20k = time.monotonic()
   ipdb> end_filter_20k - start_filter_20k
   172.30647455298458
   ipdb> in_filter = TI.dag_id == self.dag_id, TI.execution_date == 
self.execution_date, TI.task_id.in_([o.task_id for o in old_states.keys()]);
   ipdb> start_20k_custom = time.monotonic(); result_custom_20k = 
session.query(TI).filter(in_filter).all(); end_20k_custom = time.monotonic()
   ipdb> end_20k_custom - start_20k_custom
   1.1882996069907676
   ```
   
   I have also removed the check that was ensuring everything was of the
   same type (all TaskInstance or all TaskInstanceKey) as it felt needless
   - both types have the three required fields, so the "duck-typing"
     approach at runtime (crash if doesn't have the required property)+mypy
     checks felt Good Enough.
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to