[jira] [Commented] (AIRFLOW-4346) Kubernetes Executor Fails for Large Wide DAGs

Yoichi Iwaki (JIRA) Tue, 04 Jun 2019 01:36:19 -0700


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855458#comment-16855458
 ]


Yoichi Iwaki commented on AIRFLOW-4346:
---------------------------------------

[~vcastane]

It looks like you're using PVC(PersistentVolumeClaim) for DAGs volume in your 
config. Does your underlying PV/PVC supports ReadWriteMany or ReadOnlyMany? You 
can check the table in following URL.
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes

If it doesn't, created KubernetesExecutor pods can be scheduled only on the 
single node. Considering the max pods per node is limited to 100 in GKE, this 
may be causing the problem.

 

Note:
On my 4vCPU/24GB RAM environment VM, wide_dag_bash_test.py ran successfully.

 

> Kubernetes Executor Fails for Large Wide DAGs
> ---------------------------------------------
>
>                 Key: AIRFLOW-4346
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4346
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DAG, executors
>    Affects Versions: 1.10.2, 1.10.3
>            Reporter: Vincent Castaneda
>            Priority: Blocker
>              Labels: kubernetes
>         Attachments: configmap-airflow-share.yaml, sched_logs.txt, 
> wide_dag_bash_test.py, wide_dag_test_100_300.py, wide_dag_test_300_300.py
>
>
> When running large DAGs–those with parallelism of over 100 task instances to 
> be running concurrently--several tasks fail on the executor and are reported 
> to the database, but the scheduler is never aware of them failing.
> Attached are:
>  - A test DAG that we can use to replicate the issue.
>  - The configmap-airflow.yaml file
> I will be available to answer any other questions that are raised about our 
> configuration. We are running this on GKE and giving the scheduler and web 
> pod a base 100m for execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-4346) Kubernetes Executor Fails for Large Wide DAGs

Reply via email to