[
https://issues.apache.org/jira/browse/AIRFLOW-4346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16855458#comment-16855458
]
Yoichi Iwaki commented on AIRFLOW-4346:
---------------------------------------
[~vcastane]
It looks like you're using PVC(PersistentVolumeClaim) for DAGs volume in your
config. Does your underlying PV/PVC supports ReadWriteMany or ReadOnlyMany? You
can check the table in following URL.
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
If it doesn't, created KubernetesExecutor pods can be scheduled only on the
single node. Considering the max pods per node is limited to 100 in GKE, this
may be causing the problem.
Note:
On my 4vCPU/24GB RAM environment VM, wide_dag_bash_test.py ran successfully.
> Kubernetes Executor Fails for Large Wide DAGs
> ---------------------------------------------
>
> Key: AIRFLOW-4346
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4346
> Project: Apache Airflow
> Issue Type: Bug
> Components: DAG, executors
> Affects Versions: 1.10.2, 1.10.3
> Reporter: Vincent Castaneda
> Priority: Blocker
> Labels: kubernetes
> Attachments: configmap-airflow-share.yaml, sched_logs.txt,
> wide_dag_bash_test.py, wide_dag_test_100_300.py, wide_dag_test_300_300.py
>
>
> When running large DAGs–those with parallelism of over 100 task instances to
> be running concurrently--several tasks fail on the executor and are reported
> to the database, but the scheduler is never aware of them failing.
> Attached are:
> - A test DAG that we can use to replicate the issue.
> - The configmap-airflow.yaml file
> I will be available to answer any other questions that are raised about our
> configuration. We are running this on GKE and giving the scheduler and web
> pod a base 100m for execution.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)