turbaszek commented on pull request #8227:
URL: https://github.com/apache/airflow/pull/8227#issuecomment-635886652


   Ok, I did a quick test on 30k DagRun records.
   
   ```python
   def make():
       with create_session() as s:
           for j, t in enumerate(DagRunType):
               for i in range(10000):
                   now = days_ago(i)
                   run_id = f"{t.value}__{str(uuid.uuid4())}"
                   run = DagRun(
                       dag_id=f"test_{j}_{t.value}",
                       run_id=run_id,
                       execution_date=now,
                       run_type=t.value,
                   )
                   s.add(run)
           s.commit()
   
   if __name__ == '__main__':
       make()
   
       DR = DagRun
       N = 100
   
       def query(kwargs):
           with create_session() as s:
               s.query(DR).filter_by(**kwargs)
   
       @timing(N)
       @repeat(N)
       def test(kwargs):
           query(kwargs)
   
   
       kwargs = (
           ("dag_id", "test_0_backfill"),
           ("run_id", "backfill__b7d81c6b-a460-4a47-97c0-b2d587a48b12"),
           ("state", "running"),
           ("run_type", "backfill")
       )
       for i in range(1, 5):
           cs = combinations(kwargs, i)
           for c in cs:
               k = dict(c)
               print(f"Case: {k.keys()}")
               test(k)
   ```
   
   Then for each filter keys combination I counted how many times given idex 
yielded the minimum time, here is the result:
   ```json
   "Index('dag_id_state_type', dag_id, _state, run_type),": 4,
   "Index('dag_id_state', dag_id, _state),Index('dag_id_type', dag_id, 
run_type),": 0,
   "Index('dag_id_type', dag_id, run_type),": 4,
   "Index('dag_id_state', dag_id, _state),": 7}
   ```
   
   Of course this means that "if you use each combination of keys as often as 
others" then use `Index('dag_id_state', dag_id, _state)` and for sure do not 
use a double index, unless you remember to specify which index should be used 
in query (easy to abuse).
   
   Here are exact result: 
https://docs.google.com/spreadsheets/d/102Y-u7Uy2YF_-f0y-sxhUIZzNTrfPPYQVqfjBmAp1Oo/edit?usp=sharing
   
   And what you can see is that `Index('dag_id_state_type', dag_id, _state, 
run_type),` wins sometimes but when it loose it's quite visible. And when 
`Index('dag_id_type', dag_id, run_type)` wins it's not a spectacular victory. 
   
   Thus, I would say that we should keep `Index('dag_id_state', dag_id, _state)`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to