[ https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aditya Vishwakarma updated AIRFLOW-5660: ---------------------------------------- Summary: Scheduler becomes unresponsive when processing large DAGs on kubernetes. (was: Scheduler becomes responsive when processing large DAGs on kubernetes.) > Scheduler becomes unresponsive when processing large DAGs on kubernetes. > ------------------------------------------------------------------------ > > Key: AIRFLOW-5660 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5660 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes > Affects Versions: 1.10.5 > Reporter: Aditya Vishwakarma > Assignee: Daniel Imberman > Priority: Major > > For very large dags( 10,000+) and high parallelism, the scheduling loop can > take more 5-10 minutes. > It seems that `_labels_to_key` function in kubernetes_executor loads all > tasks with a given execution date into memory. It does it for every task in > progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it > will load million tasks on every tick of the scheduler from db. > [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598] > A quick fix is to search for task in the db directly before regressing to > full scan. I can submit a PR for it. > A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id, > dag_id, task_id, execution_date) somewhere, probably in the metadatabase. > -- This message was sent by Atlassian Jira (v8.3.4#803005)