Overbryd opened a new issue #13542:
URL: https://github.com/apache/airflow/issues/13542


   **Apache Airflow version**: `2.0.0`
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   
   ```
   Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", 
GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", 
BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", 
Platform:"darwin/amd64"}
   Server Version: version.Info{Major:"1", Minor:"17+", 
GitVersion:"v1.17.14-gke.1600", 
GitCommit:"7c407f5cc8632f9af5a2657f220963aa7f1c46e7", GitTreeState:"clean", 
BuildDate:"2020-12-07T09:22:27Z", GoVersion:"go1.13.15b4", Compiler:"gc", 
Platform:"linux/amd64"}
   ```
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: GKE
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**:
     - Airflow metadata database is hooked up to a PostgreSQL instance
   
   **What happened**:
   
   * Airflow 2.0.0 running on the `KubernetesExecutor` has many tasks stuck in 
"scheduled" or "queued" state which never get resolved.
   * The setup has a `default_pool` of 16 slots.
   * Currently no slots are used (see Screenshot), but all slots are queued.
   * No work is executed any more. The Executor or Scheduler is stuck.
   * There are many many tasks stuck in "scheduled" state
     * Tasks in "scheduled" state say `('Not scheduling since there are %s open 
slots in pool %s and require %s pool slots', 0, 'default_pool', 1)`
       That is simply not true, because there is nothing running on the cluster 
and there are always 16 tasks stuck in "queued".
   * There are many tasks stuck in "queued" state
     * Tasks in "queued" state say `Task is in the 'running' state which is not 
a valid state for execution. The task must be cleared in order to be run.`
       That is also not true. Nothing is running on the cluster and Airflow is 
likely just lying to itself. It seems the KubernetesExecutor and the scheduler 
easily go out of sync.
   
   **What you expected to happen**:
   
   * Airflow should resolve scheduled or queued tasks by itself once the pool 
has available slots
   * Airflow should use all available slots in the pool
   * It should be possible to clear a couple hundred tasks and expect the 
system to stay consistent
   
   **How to reproduce it**:
   
   * Vanilla Airflow 2.0.0 with `KubernetesExecutor` on Python `3.7.9`
   * `requirements.txt`
   
     ```
     pyodbc==4.0.30
     pycryptodomex==3.9.9
     apache-airflow-providers-google==1.0.0
     apache-airflow-providers-odbc==1.0.0
     apache-airflow-providers-postgres==1.0.0
     apache-airflow-providers-cncf-kubernetes==1.0.0
     apache-airflow-providers-sftp==1.0.0
     apache-airflow-providers-ssh==1.0.0
     ```
   
   * The only reliable way to trigger that weird bug is to clear the task state 
of many tasks at once. (> 300 tasks)
   
   **Anything else we need to know**:
   
   Don't know, as always I am happy to help debug this problem.
   The scheduler/executer seems to go out of sync and never back in sync again 
with the state of the world.
   
   We actually planned to upscale our Airflow installation with many more 
simultaneous tasks. With these severe yet basic scheduling/queuing problems we 
cannot move forward at all.
   
   Another strange, likely unrelated observation, the scheduler always uses 
100% of the CPU. Burning it. Even with no scheduled or now queued tasks, its 
always very very busy.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to