kckoh commented on issue #62025:
URL: https://github.com/apache/airflow/issues/62025#issuecomment-3925235289
I also reproduced the 200k DagRuns on a single dag but i didn't get the slow
query as pierrejeambrun got on the screen recording. The response was in a
range between 300ms -1s. However, i looked into the query and it looks like :
`
SELECT
dr.*,
d.*,
drn.*
FROM dag_run dr
LEFT JOIN dag d
ON dr.dag_id = d.dag_id
LEFT JOIN dag_run_note drn
ON dr.id = drn.dag_run_id
WHERE dr.dag_id IN (<list_of_example_dag_ids>)
ORDER BY dr.run_after DESC, dr.id DESC
LIMIT 50 OFFSET 0;
`
The DagRun data model has several indexes
`
Index("dag_id_state", dag_id, _state),
Index("idx_dag_run_dag_id", dag_id),
Index("idx_dag_run_run_after", run_after),
Index(
"idx_dag_run_running_dags",
"state",
"dag_id",
postgresql_where=text("state='running'"),
sqlite_where=text("state='running'"),
),
# since mysql lacks filtered/partial indices, this creates a
# duplicate index on mysql. Not the end of the world
Index(
"idx_dag_run_queued_dags",
"state",
"dag_id",
postgresql_where=text("state='queued'"),
sqlite_where=text("state='queued'"),
),
`
In the frontend run_after is being passed as the query param for oder_by and
the id gets appended on the backend.
I propose that we could remove Index("idx_dag_run_run_after", run_after) and
add Index("idx_dag_run_run_after_id", run_after, id) to cover the order by for
run_after and the id.
Or, we could just add Index("idx_dag_run_run_after", run_after) and leave
the existing index.
What do you propose @pierrejeambrun ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]