[
https://issues.apache.org/jira/browse/AIRFLOW-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856341#comment-16856341
]
Yoichi Iwaki commented on AIRFLOW-4346:
---------------------------------------
[~vcastane]
Thank you for the information.
The problem is that you may be able to set the access mode to ReadWriteMany on
manifest file even if the underlying VolumePlugin (ex. hostPath, GCS) doesn’t
support ReadWriteMany mode. As you can see in the attached log file at
963-964L, KubernetesExecutor pods tries to mount the PVC with read_only=None;
this means the pod tires to mount the PVC as ReadWrite mode.
Therefore if your underlying VolumePlugin doesn't support ReadWriteMany mode
and you're running pods on multiple GKE nodes, this may be the cause of the
problem.
> Kubernetes Executor Fails for Large Wide DAGs
> ---------------------------------------------
>
> Key: AIRFLOW-4346
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4346
> Project: Apache Airflow
> Issue Type: Bug
> Components: DAG, executors
> Affects Versions: 1.10.2, 1.10.3
> Reporter: Vincent Castaneda
> Priority: Blocker
> Labels: kubernetes
> Attachments: configmap-airflow-share.yaml, sched_logs.txt,
> wide_dag_bash_test.py, wide_dag_test_100_300.py, wide_dag_test_300_300.py
>
>
> When running large DAGs–those with parallelism of over 100 task instances to
> be running concurrently--several tasks fail on the executor and are reported
> to the database, but the scheduler is never aware of them failing.
> Attached are:
> - A test DAG that we can use to replicate the issue.
> - The configmap-airflow.yaml file
> I will be available to answer any other questions that are raised about our
> configuration. We are running this on GKE and giving the scheduler and web
> pod a base 100m for execution.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)