[
https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930520#comment-16930520
]
Chris Wegrzyn commented on AIRFLOW-5447:
----------------------------------------
I'm afraid that's not our issue. We're using the helm/charts helm chart, which
has these permissions granted. Here's the rules section of the role bound to
the service account used by our pods (this is copied from actual deployed
values, just in case we drifted from the chart for whatever reason):
{code:java}
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- create
- get
- delete
- list
- watch
- apiGroups:
- ""
resources:
- pods/log
verbs:
- get
- list
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
- get
{code}
For what it's worth, one relevant change I made from the default config was
overriding:
{code:java}
kube_client_request_args = {"_request_timeout" : [60,60] }{code}
I have it set to \{"_request_timeout": null}. If left with a timeout, I get a
read timeout on the watch, which leads to "Unknown error in
KubernetesJobWatcher". I've kubectl exec'ed into the pod, and used python and
the python kubernetes client library to run a few calls like
list_namespaced_pods, and it works fine. So it's not connectivity per se.
In any event, even supposing that read timeout should NOT have happened, the
normal order of operations suggests that KubernetesExecutor#sync should call
AirflowKubernetesScheduler#sync which should health check the job watcher and
restart it. This does not appear to happen (which also reinforces the
appearance that some thread is hung).
> KubernetesExecutor hangs on task queueing
> -----------------------------------------
>
> Key: AIRFLOW-5447
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5447
> Project: Apache Airflow
> Issue Type: Bug
> Components: executor-kubernetes
> Affects Versions: 1.10.4, 1.10.5
> Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5
> Reporter: Henry Cohen
> Assignee: Daniel Imberman
> Priority: Blocker
>
> Starting in 1.10.4, and continuing in 1.10.5, when using the
> KubernetesExecutor, with the webserver and scheduler running in the
> kubernetes cluster, tasks are scheduled, but when added to the task queue,
> the executor process hangs indefinitely. Based on log messages, it appears to
> be stuck at this line
> https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761
--
This message was sent by Atlassian Jira
(v8.3.2#803003)