[jira] [Commented] (AIRFLOW-4346) Kubernetes Executor Fails for Large Wide DAGs

Yoichi Iwaki (JIRA) Tue, 04 Jun 2019 21:43:25 -0700


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856341#comment-16856341
 ]


Yoichi Iwaki commented on AIRFLOW-4346:
---------------------------------------

[~vcastane]

Thank you for the information.

The problem is that you may be able to set the access mode to ReadWriteMany on 
manifest file even if the underlying VolumePlugin (ex. hostPath, GCS) doesn’t 
support ReadWriteMany mode. As you can see in the attached log file at 
963-964L, KubernetesExecutor pods tries to mount the PVC with read_only=None; 
this means the pod tires to mount the PVC as ReadWrite mode.

Therefore if your underlying VolumePlugin doesn't support ReadWriteMany mode 
and you're running pods on multiple GKE nodes, this may be the cause of the 
problem.

> Kubernetes Executor Fails for Large Wide DAGs
> ---------------------------------------------
>
>                 Key: AIRFLOW-4346
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4346
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DAG, executors
>    Affects Versions: 1.10.2, 1.10.3
>            Reporter: Vincent Castaneda
>            Priority: Blocker
>              Labels: kubernetes
>         Attachments: configmap-airflow-share.yaml, sched_logs.txt, 
> wide_dag_bash_test.py, wide_dag_test_100_300.py, wide_dag_test_300_300.py
>
>
> When running large DAGs–those with parallelism of over 100 task instances to 
> be running concurrently--several tasks fail on the executor and are reported 
> to the database, but the scheduler is never aware of them failing.
> Attached are:
>  - A test DAG that we can use to replicate the issue.
>  - The configmap-airflow.yaml file
> I will be available to answer any other questions that are raised about our 
> configuration. We are running this on GKE and giving the scheduler and web 
> pod a base 100m for execution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-4346) Kubernetes Executor Fails for Large Wide DAGs

Reply via email to