[ 
https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930520#comment-16930520
 ] 

Chris Wegrzyn commented on AIRFLOW-5447:
----------------------------------------

I'm afraid that's not our issue. We're using the helm/charts helm chart, which 
has these permissions granted. Here's the rules section of the role bound to 
the service account used by our pods (this is copied from actual deployed 
values, just in case we drifted from the chart for whatever reason):

 
{code:java}
  rules:
  - apiGroups:
    - ""
    resources:
    - pods
    verbs:
    - create
    - get
    - delete
    - list
    - watch
  - apiGroups:
    - ""
    resources:
    - pods/log
    verbs:
    - get
    - list
  - apiGroups:
    - ""
    resources:
    - pods/exec
    verbs:
    - create
    - get
{code}
For what it's worth, one relevant change I made from the default config was 
overriding:
{code:java}
kube_client_request_args = {"_request_timeout" : [60,60] }{code}
I have it set to \{"_request_timeout": null}. If left with a timeout, I get a 
read timeout on the watch, which leads to "Unknown error in 
KubernetesJobWatcher". I've kubectl exec'ed into the pod, and used python and 
the python kubernetes client library to run a few calls like 
list_namespaced_pods, and it works fine. So it's not connectivity per se.

In any event, even supposing that read timeout should NOT have happened, the 
normal order of operations suggests that KubernetesExecutor#sync should call 
AirflowKubernetesScheduler#sync which should health check the job watcher and 
restart it. This does not appear to happen (which also reinforces the 
appearance that some thread is hung).

 

> KubernetesExecutor hangs on task queueing
> -----------------------------------------
>
>                 Key: AIRFLOW-5447
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5447
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor-kubernetes
>    Affects Versions: 1.10.4, 1.10.5
>         Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5
>            Reporter: Henry Cohen
>            Assignee: Daniel Imberman
>            Priority: Blocker
>
> Starting in 1.10.4, and continuing in 1.10.5, when using the 
> KubernetesExecutor, with the webserver and scheduler running in the 
> kubernetes cluster, tasks are scheduled, but when added to the task queue, 
> the executor process hangs indefinitely. Based on log messages, it appears to 
> be stuck at this line 
> https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to