turbaszek commented on pull request #8227:
URL: https://github.com/apache/airflow/pull/8227#issuecomment-635886652
Ok, I did a quick test on 30k DagRun records.
```python
def make():
with create_session() as s:
for j, t in enumerate(DagRunType):
for i in range(10000):
now = days_ago(i)
run_id = f"{t.value}__{str(uuid.uuid4())}"
run = DagRun(
dag_id=f"test_{j}_{t.value}",
run_id=run_id,
execution_date=now,
run_type=t.value,
)
s.add(run)
s.commit()
if __name__ == '__main__':
make()
DR = DagRun
N = 100
def query(kwargs):
with create_session() as s:
s.query(DR).filter_by(**kwargs)
@timing(N)
@repeat(N)
def test(kwargs):
query(kwargs)
kwargs = (
("dag_id", "test_0_backfill"),
("run_id", "backfill__b7d81c6b-a460-4a47-97c0-b2d587a48b12"),
("state", "running"),
("run_type", "backfill")
)
for i in range(1, 5):
cs = combinations(kwargs, i)
for c in cs:
k = dict(c)
print(f"Case: {k.keys()}")
test(k)
```
Then for each filter keys combination I counted how many times given idex
yielded the minimum time, here is the result:
```json
"Index('dag_id_state_type', dag_id, _state, run_type),": 4,
"Index('dag_id_state', dag_id, _state),Index('dag_id_type', dag_id,
run_type),": 0,
"Index('dag_id_type', dag_id, run_type),": 4,
"Index('dag_id_state', dag_id, _state),": 7}
```
Of course this means that "if you use each combination of keys as often as
others" then use `Index('dag_id_state', dag_id, _state)` and for sure do not
use a double index, unless you remember to specify which index should be used
in query (easy to abuse).
Here are exact result:
https://docs.google.com/spreadsheets/d/102Y-u7Uy2YF_-f0y-sxhUIZzNTrfPPYQVqfjBmAp1Oo/edit?usp=sharing
And what you can see is that `Index('dag_id_state_type', dag_id, _state,
run_type),` wins sometimes but when it loose it's quite visible. And when
`Index('dag_id_type', dag_id, run_type)` wins it's not a spectacular victory.
Thus, I would say that we should keep `Index('dag_id_state', dag_id, _state)`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]