[ 
https://issues.apache.org/jira/browse/AIRFLOW-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarek Potiuk updated AIRFLOW-3797:
----------------------------------
    Labels: gsoc gsoc2020 mentor  (was: )

> Improve performance of cc1e65623dc7_add_max_tries_column_to_task_instance 
> migration
> -----------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-3797
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3797
>             Project: Apache Airflow
>          Issue Type: Improvement
>            Reporter: Bas Harenslak
>            Priority: Major
>              Labels: gsoc, gsoc2020, mentor
>
> The cc1e65623dc7_add_max_tries_column_to_task_instance migration creates a 
> DagBag for the corresponding DAG for every single task instance. This is very 
> redundant and not necessary.
> Hence, there are discussions on Slack like these:
> {noformat}
> murquizo   [Jan 17th at 1:33 AM]
> Why does the airflow upgradedb command loop through all of the dags?
> ....
> murquizo   [14 days ago]
> NICE, @BasPH! that is exactly the migration that I was referring to.  We have 
> about 600k task instances and have a several python files that generate 
> multiple DAGs, so looping through all of the task_instances to update 
> max_tries was too slow.  It took 3 hours and didnt even complete! i pulled 
> the plug and manually executed the migration.   Thanks for your response.
> {noformat}
> An easy to accomplish improvement is to parse a DAG only once and after that 
> set the task instance try_number. I created a branch for it 
> (https://github.com/BasPH/incubator-airflow/tree/bash-optimise-db-upgrade), 
> currently running tests and will make PR when done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to