[
https://issues.apache.org/jira/browse/AIRFLOW-5660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16952222#comment-16952222
]
Aditya Vishwakarma commented on AIRFLOW-5660:
---------------------------------------------
[~dimberman] Can you tell me more about the scale testing? I am trying to run
very large dags and also trying to run a lot of them. In essense I am running
into scaling issues in prod. This one is one of the few issues I faced.
For eg, next one on my list is the DagFileProcessor, it can take 300-500
seconds to process a large dag. I would love to be able to experiment with some
performance testing framework.
> Scheduler becomes unresponsive when processing large DAGs on kubernetes.
> ------------------------------------------------------------------------
>
> Key: AIRFLOW-5660
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5660
> Project: Apache Airflow
> Issue Type: Bug
> Components: executor-kubernetes
> Affects Versions: 1.10.5
> Reporter: Aditya Vishwakarma
> Assignee: Daniel Imberman
> Priority: Major
>
> For very large dags( 10,000+) and high parallelism, the scheduling loop can
> take more 5-10 minutes.
> It seems that `_labels_to_key` function in kubernetes_executor loads all
> tasks with a given execution date into memory. It does it for every task in
> progress. So, if 100 tasks are in progress of a dag with 10,000 tasks, it
> will load million tasks on every tick of the scheduler from db.
> [https://github.com/apache/airflow/blob/caf1f264b845153b9a61b00b1a57acb7c320e743/airflow/contrib/executors/kubernetes_executor.py#L598]
> A quick fix is to search for task in the db directly before regressing to
> full scan. I can submit a PR for it.
> A proper fix requires persisting a mapping of (safe_dag_id, safe_task_id,
> dag_id, task_id, execution_date) somewhere, probably in the metadatabase.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)