[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932326#comment-16932326 ] ASF subversion and git services commented on AIRFLOW-5447: -- Commit b88d4c5a7721c061a6488d519be1646ae0fdddea in airflow's branch refs/heads/v1-10-test from Daniel Imberman [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=b88d4c5 ] [AIRFLOW-5447] Scheduler stalls because second watcher thread in default args > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931826#comment-16931826 ] Daniel Imberman commented on AIRFLOW-5447: -- [~HPCohen] it's already in the 1-10-test branch. I'm gonna work with [~kaxilnaik] and [~ash] to see if we can't release hotfix 1.10.4/5 releases since this is such a critical bug. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931779#comment-16931779 ] ASF subversion and git services commented on AIRFLOW-5447: -- Commit 4fb29030a1f848f1cca64ad5558068509e94472e in airflow's branch refs/heads/v1-10-test from Daniel Imberman [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=4fb2903 ] [AIRFLOW-5447] Scheduler stalls because second watcher thread in default args > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931681#comment-16931681 ] Henry Cohen commented on AIRFLOW-5447: -- Thank you guys so much for working on this, any idea on how long until the fix is out now that it's been merged? > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931476#comment-16931476 ] ASF subversion and git services commented on AIRFLOW-5447: -- Commit c098ff78508f71c038f901903f6ab587c3dc9b01 in airflow's branch refs/heads/master from Daniel Imberman [ https://gitbox.apache.org/repos/asf?p=airflow.git;h=c098ff7 ] [AIRFLOW-5447] Scheduler stalls because second watcher thread in default args > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931475#comment-16931475 ] ASF GitHub Bot commented on AIRFLOW-5447: - dimberman commented on pull request #6129: [AIRFLOW-5447] Scheduler stalls because second watcher thread in default args URL: https://github.com/apache/airflow/pull/6129 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931309#comment-16931309 ] Kaxil Naik commented on AIRFLOW-5447: - Wow, interesting !! Thank you so much [~cwegrzyn] and everyone. This code has been around for ages, wondering the issue has been since the start or just from 1.10.4 [~dimberman]? > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931069#comment-16931069 ] ASF GitHub Bot commented on AIRFLOW-5447: - dimberman commented on pull request #6128: [AIRFLOW-5447] Don't set the executor with a default arg URL: https://github.com/apache/airflow/pull/6128 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931046#comment-16931046 ] ASF GitHub Bot commented on AIRFLOW-5447: - dimberman commented on pull request #6129: [AIRFLOW-5447] Scheduler stalls because second watcher thread in default args URL: https://github.com/apache/airflow/pull/6129 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-XXX - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a Jira issue. - In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)). - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Description - [ ] Here are some details about my PR, including screenshots of any UI changes: ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931031#comment-16931031 ] Daniel Imberman commented on AIRFLOW-5447: -- Thank you so much for catching this [~cwegrzyn]. Yeah having two thread managers it would totally make sense for some sort of deadlocking to occur. Let's confirm this and then hotpatch it into 1.10.4/5 cc: [~kaxilnaik] [~ash] [~schnie] > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930993#comment-16930993 ] Chris Wegrzyn commented on AIRFLOW-5447: Hmm... Confirmed. Dropping that airflow.[www.app|http://www.app/] import from our plugin fixed the issue. I've pushed up a [PR|[https://github.com/apache/airflow/pull/6128]] that I believe resolves this issue. Will see if I can create a minimal setup where I can reproduce the problem tomorrow to confirm that this really is the issue. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930992#comment-16930992 ] ASF GitHub Bot commented on AIRFLOW-5447: - cwegrzyn commented on pull request #6128: [AIRFLOW-5447] Don't set the executor with a default arg URL: https://github.com/apache/airflow/pull/6128 Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR" - https://issues.apache.org/jira/browse/AIRFLOW-5447 - In case you are fixing a typo in the documentation you can prepend your commit with \[AIRFLOW-XXX\], code changes always need a Jira issue. - In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)). - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Description - [x] Here are some details about my PR, including screenshots of any UI changes: The call to get a default executor was being executed at module import time. If this module is imported by a DAG, it can lead to two executors running in the scheduler, which in the case of the KubernetesExecutor can cause a deadlock. This defers creating the default executor to initialization. ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930889#comment-16930889 ] Chris Wegrzyn commented on AIRFLOW-5447: I noticed the two thread manager servers earlier too, and I’m currently exploring the hypothesis that that is the problem. It’s happening because of a useless import triggering a default argument creating an executor. Certainly seems like the sort of thing that could cause deadlocks if you have two of ‘em. Got sidetracked but hope to take a look later tonight. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930761#comment-16930761 ] Daniel Imberman commented on AIRFLOW-5447: -- Ok so I've broken down the currently running threads in hopes this helps us out Thread 1: attempting to put a new task in the task_queue {code:java} Thread 0x7f0c13c7d700 File "/usr/local/airflow/.local/bin/airflow", line 32, in args.func(args) File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/utils/cli.py", line 74, in wrapper return f(*args, **kwargs) File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/bin/cli.py", line 1013, in scheduler job.run() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 213, in run self._execute() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1350, in _execute self._execute_helper() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1439, in _execute_helper self.executor.heartbeat() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/executors/base_executor.py", line 132, in heartbeat self.trigger_tasks(open_slots) File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/executors/base_executor.py", line 156, in trigger_tasks executor_config=simple_ti.executor_config) File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 767, in execute_async self.task_queue.put((key, command, kube_executor_config)) File "", line 2, in put File "/usr/local/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod kind, result = conn.recv() File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) File "", line 1, in File "", line 5, in {code} Thread 2: re-reading plugins files {code:java} Thread 0x7f0c01c31700 File "/usr/local/lib/python3.7/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/usr/local/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs)Thread 0x7f0bff430700 File "/usr/local/lib/python3.7/multiprocessing/managers.py", line 201, in handle_request result = func(c, *args, **kwds) File "/usr/local/lib/python3.7/multiprocessing/managers.py", line 422, in accept_connection self.serve_client(c) File "/usr/local/lib/python3.7/multiprocessing/managers.py", line 234, in serve_client request = recv() File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) File "", line 202, in _lock_unlock_module File "", line 98, in acquire File "/usr/local/lib/python3.7/threading.py", line 890, in _bootstrap self._bootstrap_inner() File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "/usr/local/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.7/multiprocessing/managers.py", line 178, in accepter c = self.listener.accept() File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 453, in accept c = self._listener.accept() File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 598, in accept s, self._last_accepted = self._socket.accept() File "/usr/local/lib/python3.7/socket.py", line 212, in accept fd, addr = self._accept() File "/usr/local/airflow/.local/bin/airflow", line 21, in from airflow import configuration File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/__init__.py", line 94, in operators._integrate_plugins() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/operators/__init__.py", line 104, in _integrate_plugins from airflow.plugins_manager import operators_modules File "", line 983, in _find_and_load File "", line 967, in _find_and_load_unlocked File "", line 677, in _load_unlocked {code} Thread 3: thread manager server {code:java} Thread 0x7f0c13c7d700 File "", line 728, in exec_module File "", line 219, in _call_with_frames_removed File "/usr/local/airflow/.local/lib/python3
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930754#comment-16930754 ] Daniel Imberman commented on AIRFLOW-5447: -- [~cwegrzyn] this is nuts... It might be a race condition/failure in multiprocessing. How many tasks are you trying to launch at once? This is giving me a lot to investigate with btw so thank you. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930751#comment-16930751 ] Daniel Imberman commented on AIRFLOW-5447: -- Thank you [~cwegrzyn]! That helps massively. Let me look through. Also mentioning this issue in case there's any relation: https://issues.apache.org/jira/browse/AIRFLOW-5506 > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930644#comment-16930644 ] Chris Wegrzyn commented on AIRFLOW-5447: After a bit of wrestling with pyrasite and probably dumb luck, I managed to get what appears to be a telling stack trace: {code:java} Thread 0x7fb39d56d700 File "/usr/local/airflow/.local/bin/airflow", line 32, in args.func(args) File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/utils/cli.py", line 74, in wrapper return f(*args, **kwargs) File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/bin/cli.py", line 1013, in scheduler job.run() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 213, in run self._execute() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1350, in _execute self._execute_helper() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1439, in _execute_helper self.executor.heartbeat() File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/executors/base_executor.py", line 132, in heartbeat self.trigger_tasks(open_slots) File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/executors/base_executor.py", line 156, in trigger_tasks executor_config=simple_ti.executor_config) File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 767, in execute_async self.task_queue.put((key, command, kube_executor_config)) File "", line 2, in put File "/usr/local/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod kind, result = conn.recv() File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/usr/local/lib/python3.7/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) File "", line 1, in File "", line 5, in {code} It does seem like something is going wrong with the communication related to the put to the task_queue. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930520#comment-16930520 ] Chris Wegrzyn commented on AIRFLOW-5447: I'm afraid that's not our issue. We're using the helm/charts helm chart, which has these permissions granted. Here's the rules section of the role bound to the service account used by our pods (this is copied from actual deployed values, just in case we drifted from the chart for whatever reason): {code:java} rules: - apiGroups: - "" resources: - pods verbs: - create - get - delete - list - watch - apiGroups: - "" resources: - pods/log verbs: - get - list - apiGroups: - "" resources: - pods/exec verbs: - create - get {code} For what it's worth, one relevant change I made from the default config was overriding: {code:java} kube_client_request_args = {"_request_timeout" : [60,60] }{code} I have it set to \{"_request_timeout": null}. If left with a timeout, I get a read timeout on the watch, which leads to "Unknown error in KubernetesJobWatcher". I've kubectl exec'ed into the pod, and used python and the python kubernetes client library to run a few calls like list_namespaced_pods, and it works fine. So it's not connectivity per se. In any event, even supposing that read timeout should NOT have happened, the normal order of operations suggests that KubernetesExecutor#sync should call AirflowKubernetesScheduler#sync which should health check the job watcher and restart it. This does not appear to happen (which also reinforces the appearance that some thread is hung). > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930224#comment-16930224 ] Daniel Imberman commented on AIRFLOW-5447: -- [~Yuval.Itzchakov] [~cwegrzyn]Thank you guys for getting this info to us. I THINK this might have to do with a bug in the k8s kubernetes client which requires "create" and "get" privileges for "pods/exec" [https://stackoverflow.com/questions/53827345/airflow-k8s-operator-xcom-handshake-status-403-forbidden] [https://github.com/kubernetes-client/python/issues/690] The reason I believe this is that this lack of running/updating of pods point to a failure or the KubernetesJobWatcher. When we finally started seeing similar problems we were seeing these failures from the JobWatcher [https://user-images.githubusercontent.com/1036482/64914385-2f0eca80-d71e-11e9-8f8b-44a1c8620b92.png]. I'm going to look into this further tomorrow and get back ASAP. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929517#comment-16929517 ] Chris Wegrzyn commented on AIRFLOW-5447: I just upgraded a deployment from 1.10.2 to 1.10.5 (also had the same issues on 1.10.4, haven't yet tried 1.10.3) and am experiencing the same issues. I've tracked it down to the same line. Judging by the log messages, we get this log message: [https://github.com/apache/airflow/blob/1.10.5/airflow/contrib/executors/kubernetes_executor.py#L762] But we never get this log message: [https://github.com/apache/airflow/blob/1.10.5/airflow/executors/base_executor.py#L135] Here's a slightly sanitized log: {{[2019-09-13 19:51:24,074] \{{scheduler_job.py:1438}} DEBUG - Heartbeating the executor}} {{[2019-09-13 19:51:24,074] \{{base_executor.py:124}} DEBUG - 0 running task instances}} {{[2019-09-13 19:51:24,074] \{{base_executor.py:125}} DEBUG - 0 in queue}} {{[2019-09-13 19:51:24,074] \{{base_executor.py:126}} DEBUG - 32 open slots}} {{[2019-09-13 19:51:24,075] \{{base_executor.py:135}} DEBUG - Calling the sync method}} {{[2019-09-13 19:51:24,083] \{{scheduler_job.py:1459}} DEBUG - Ran scheduling loop in 0.01 seconds}} {{[2019-09-13 19:51:24,083] \{{scheduler_job.py:1462}} DEBUG - Sleeping for 1.00 seconds}} {{[2019-09-13 19:51:24,087] \{{settings.py:54}} INFO - Configured default timezone }} {{[2019-09-13 19:51:24,093] \{{settings.py:327}} DEBUG - Failed to import airflow_local_settings.}} {{Traceback (most recent call last):}} {{ File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/settings.py", line 315, in import_local_settings}} {{ import airflow_local_settings}} {{ModuleNotFoundError: No module named 'airflow_local_settings'}} {{[2019-09-13 19:51:24,094] \{{logging_config.py:59}} DEBUG - Unable to load custom logging, using default config instead}} {{[2019-09-13 19:51:24,109] \{{settings.py:170}} DEBUG - Setting up DB connection pool (PID 49)}} {{[2019-09-13 19:51:24,110] \{{settings.py:213}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=49}} {{[2019-09-13 19:51:24,295] \{{settings.py:238}} DEBUG - Disposing DB connection pool (PID 55)}} {{[2019-09-13 19:51:24,380] \{{settings.py:238}} DEBUG - Disposing DB connection pool (PID 59)}} {{[2019-09-13 19:51:25,084] \{{scheduler_job.py:1474}} DEBUG - Sleeping for 0.99 seconds to prevent excessive logging}} {{[2019-09-13 19:51:25,117] \{{scheduler_job.py:257}} DEBUG - Waiting for }} {{[2019-09-13 19:51:25,118] \{{scheduler_job.py:257}} DEBUG - Waiting for }} {{[2019-09-13 19:51:25,226] \{{settings.py:238}} DEBUG - Disposing DB connection pool (PID 69)}} {{[2019-09-13 19:51:25,278] \{{settings.py:238}} DEBUG - Disposing DB connection pool (PID 73)}} {{[2019-09-13 19:51:26,076] \{{scheduler_job.py:1390}} DEBUG - Starting Loop...}} {{[2019-09-13 19:51:26,076] \{{scheduler_job.py:1401}} DEBUG - Harvesting DAG parsing results}} {{[2019-09-13 19:51:26,076] \{{dag_processing.py:637}} DEBUG - Received message of type DagParsingStat}} {{[2019-09-13 19:51:26,077] \{{dag_processing.py:637}} DEBUG - Received message of type SimpleDag}} {{[2019-09-13 19:51:26,077] \{{dag_processing.py:637}} DEBUG - Received message of type DagParsingStat}} {{[2019-09-13 19:51:26,078] \{{dag_processing.py:637}} DEBUG - Received message of type DagParsingStat}} {{[2019-09-13 19:51:26,078] \{{scheduler_job.py:1403}} DEBUG - Harvested 1 SimpleDAGs}} {{[2019-09-13 19:51:26,109] \{{scheduler_job.py:921}} INFO - 1 tasks up for execution:}} {{ }} {{[2019-09-13 19:51:26,122] \{{scheduler_job.py:953}} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 1 task instances ready to be queued}} {{[2019-09-13 19:51:26,123] \{{scheduler_job.py:981}} INFO - DAG parse_log has 0/16 running and queued tasks}} {{[2019-09-13 19:51:26,132] \{{scheduler_job.py:257}} DEBUG - Waiting for }} {{[2019-09-13 19:51:26,133] \{{scheduler_job.py:257}} DEBUG - Waiting for }} {{[2019-09-13 19:51:26,142] \{{scheduler_job.py:1031}} INFO - Setting the following tasks to queued state:}} {{ }} {{[2019-09-13 19:51:26,157] \{{scheduler_job.py:1107}} INFO - Setting the following 1 tasks to queued state:}} {{ }} {{[2019-09-13 19:51:26,157] \{{scheduler_job.py:1143}} INFO - Sending ('parse_log', 'xyz_parse_log_2019-09-13', datetime.datetime(2019, 9, 12, 0, 0, tzinfo=), 1) to executor with priority 2 and queue default}} {{[2019-09-13 19:51:26,158] \{{base_executor.py:59}} INFO - Adding to queue: ['airflow', 'run', 'parse_log', 'xyz_parse_log_2019-09-13', '2019-09-12T00:00:00+00:00', '--local', '--pool', 'default_pool', '-sd', '/usr/local/airflow/dags/xyz.py']}} {{[2019-09-13 19:51:26,158] \{{scheduler_job.py:1438}} DEBUG - Heartbeating the executor}} {{[2019-09-13 19:51:26,159] \{{base_executor.py:124}} DEBUG - 0 runnin
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929200#comment-16929200 ] Kaxil Naik commented on AIRFLOW-5447: - We ([~dimberman] and our other colleagues at Astronomer) but we were not able to replicate this issue. I suspect some issue with the environment or configuration. Where do you run k8s on? GKE? EKS? Can you share your DAG and airflow.cfg file? It is hard to tell but we can try. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929191#comment-16929191 ] Henry Cohen commented on AIRFLOW-5447: -- If it helps, my pod running the webserver and scheduler is on a node with 5 cpu, and 6GB memory > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929021#comment-16929021 ] Yuval Itzchakov commented on AIRFLOW-5447: -- py3 > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928907#comment-16928907 ] Henry Cohen commented on AIRFLOW-5447: -- [~dimberman]p py3 > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928901#comment-16928901 ] Daniel Imberman commented on AIRFLOW-5447: -- [~HPCohen] [~Yuval.Itzchakov] also question: Are you running py2 or py3? > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928804#comment-16928804 ] Henry Cohen commented on AIRFLOW-5447: -- This line in particular is what lead me through my investigation to https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761: {noformat} [2019-09-12 17:56:05,186] kubernetes_executor.py:764 INFO - Add task ('example_subdag_operator', 'start', datetime.datetime(2019, 9, 10, 0, 0, tzinfo=), 1) with command ['airflow', 'run', 'example_subdag_operator', 'start', '2019-09-10T00:00:00+00:00', '--local', '--pool', 'default_pool', '-sd', '/usr/local/lib/python3.7/site-packages/airflow/example_dags/example_subdag_operator.py'] with executor_config {}{noformat} > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928788#comment-16928788 ] Henry Cohen commented on AIRFLOW-5447: -- ```[2019-09-12 17:56:03,034] \{{kubernetes_executor.py:698}} INFO - TaskInstance: found in queued state but was not launched, rescheduling [2019-09-12 17:56:03,043] \{{scheduler_job.py:1376}} INFO - Resetting orphaned tasks for active dag runs [2019-09-12 17:56:03,085] \{{base_job.py:308}} INFO - Reset the following 30 TaskInstances: [2019-09-12 17:56:03,092] \{{dag_processing.py:545}} INFO - Launched DagFileProcessorManager with pid: 35 [2019-09-12 17:56:03,093] \{{scheduler_job.py:1390}} DEBUG - Starting Loop... [2019-09-12 17:56:03,093] \{{scheduler_job.py:1401}} DEBUG - Harvesting DAG parsing results [2019-09-12 17:56:03,093] \{{scheduler_job.py:1403}} DEBUG - Harvested 0 SimpleDAGs [2019-09-12 17:56:03,093] \{{scheduler_job.py:1438}} DEBUG - Heartbeating the executor [2019-09-12 17:56:03,093] \{{base_executor.py:124}} DEBUG - 0 running task instances [2019-09-12 17:56:03,094] \{{base_executor.py:125}} DEBUG - 0 in queue [2019-09-12 17:56:03,094] \{{base_executor.py:126}} DEBUG - 96 open slots [2019-09-12 17:56:03,094] \{{base_executor.py:135}} DEBUG - Calling the sync method [2019-09-12 17:56:03,100] \{{scheduler_job.py:1459}} DEBUG - Ran scheduling loop in 0.01 seconds [2019-09-12 17:56:03,101] \{{scheduler_job.py:1462}} DEBUG - Sleeping for 1.00 seconds [2019-09-12 17:56:03,107] \{{settings.py:54}} INFO - Configured default timezone [2019-09-12 17:56:03,109] \{{settings.py:327}} DEBUG - Failed to import airflow_local_settings. Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/airflow/settings.py", line 315, in import_local_settings import airflow_local_settings ModuleNotFoundError: No module named 'airflow_local_settings' [2019-09-12 17:56:03,111] \{{logging_config.py:47}} INFO - Successfully imported user-defined logging config from log_config.LOGGING_CONFIG [2019-09-12 17:56:03,120] \{{settings.py:170}} DEBUG - Setting up DB connection pool (PID 35) [2019-09-12 17:56:03,121] \{{settings.py:213}} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=35 [2019-09-12 17:56:03,289] \{{settings.py:238}} DEBUG - Disposing DB connection pool (PID 45) [2019-09-12 17:56:03,356] \{{settings.py:238}} DEBUG - Disposing DB connection pool (PID 41) [2019-09-12 17:56:04,101] \{{scheduler_job.py:1474}} DEBUG - Sleeping for 0.99 seconds to prevent excessive logging [2019-09-12 17:56:04,126] \{{scheduler_job.py:257}} DEBUG - Waiting for [2019-09-12 17:56:04,127] \{{scheduler_job.py:257}} DEBUG - Waiting for [2019-09-12 17:56:04,162] \{{settings.py:238}} DEBUG - Disposing DB connection pool (PID 55) [2019-09-12 17:56:04,223] \{{settings.py:238}} DEBUG - Disposing DB connection pool (PID 58) [2019-09-12 17:56:05,095] \{{scheduler_job.py:1390}} DEBUG - Starting Loop... [2019-09-12 17:56:05,095] \{{scheduler_job.py:1401}} DEBUG - Harvesting DAG parsing results [2019-09-12 17:56:05,097] \{{dag_processing.py:637}} DEBUG - Received message of type DagParsingStat [2019-09-12 17:56:05,098] \{{dag_processing.py:637}} DEBUG - Received message of type SimpleDag [2019-09-12 17:56:05,098] \{{dag_processing.py:637}} DEBUG - Received message of type SimpleDag [2019-09-12 17:56:05,099] \{{dag_processing.py:637}} DEBUG - Received message of type SimpleDag [2019-09-12 17:56:05,099] \{{dag_processing.py:637}} DEBUG - Received message of type SimpleDag [2019-09-12 17:56:05,100] \{{dag_processing.py:637}} DEBUG - Received message of type DagParsingStat [2019-09-12 17:56:05,101] \{{dag_processing.py:637}} DEBUG - Received message of type DagParsingStat [2019-09-12 17:56:05,101] \{{scheduler_job.py:1403}} DEBUG - Harvested 4 SimpleDAGs [2019-09-12 17:56:05,128] \{{scheduler_job.py:921}} INFO - 5 tasks up for execution: [2019-09-12 17:56:05,138] \{{scheduler_job.py:953}} INFO - Figuring out tasks to run in Pool(name=default_pool) with 128 open slots and 5 task instances ready to be queued [2019-09-12 17:56:05,139] \{{scheduler_job.py:981}} INFO - DAG example_subdag_operator has 0/48 running and queued tasks [2019-09-12 17:56:05,139] \{{scheduler_job.py:981}} INFO - DAG latest_only_with_trigger has 0/48 running and queued tasks [2019-09-12 17:56:05,139] \{{scheduler_job.py:981}} INFO - DAG latest_only_with_trigger has 1/48 running and queued tasks [2019-09-12 17:56:05,139] \{{scheduler_job.py:981}} INFO - DAG latest_only_with_trigger has 2/48 running and queued tasks [2019-09-12 17:56:05,139] \{{scheduler_job.py:981}} INFO - DAG latest_only_with_trigger has 3/48 running and queued tasks [2019-09-12 17:56:05,139] \{{scheduler_job.py:257}} DEBUG - Waiting for [2019-09-12 17:56:05,140] \{{sched
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928757#comment-16928757 ] Kaxil Naik commented on AIRFLOW-5447: - Yes please add some stack trace and we will see at our end if we can replicate it. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928741#comment-16928741 ] Daniel Imberman commented on AIRFLOW-5447: -- [~Yuval.Itzchakov] [~HPCohen] could you guys please post either logs and/or a breaking DAG? I'm going to look into this. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928738#comment-16928738 ] Daniel Imberman commented on AIRFLOW-5447: -- Oof thank you for bringing this to my attention. cc: [~kaxilnaik] [~ash] > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (AIRFLOW-5447) KubernetesExecutor hangs on task queueing
[ https://issues.apache.org/jira/browse/AIRFLOW-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927961#comment-16927961 ] Yuval Itzchakov commented on AIRFLOW-5447: -- Can confirm I've experienced the exact same behavior. Rolling back to 1.10.3 works. > KubernetesExecutor hangs on task queueing > - > > Key: AIRFLOW-5447 > URL: https://issues.apache.org/jira/browse/AIRFLOW-5447 > Project: Apache Airflow > Issue Type: Bug > Components: executor-kubernetes >Affects Versions: 1.10.4, 1.10.5 > Environment: Kubernetes version v1.14.3, Airflow version 1.10.4-1.10.5 >Reporter: Henry Cohen >Assignee: Daniel Imberman >Priority: Blocker > > Starting in 1.10.4, and continuing in 1.10.5, when using the > KubernetesExecutor, with the webserver and scheduler running in the > kubernetes cluster, tasks are scheduled, but when added to the task queue, > the executor process hangs indefinitely. Based on log messages, it appears to > be stuck at this line > https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/executors/kubernetes_executor.py#L761 -- This message was sent by Atlassian Jira (v8.3.2#803003)