throwawayaccount11111 opened a new issue, #32928:
URL: https://github.com/apache/airflow/issues/32928

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   We are running Airflow on EKS with version 2.5.3. Airflow has been 
experiencing progressive slowness over a period of 2-3 weeks where DAGs start 
getting queued without ever executing and leads us to restart the scheduler. 
After the restart, problem goes away for a few days and then starts to slowly 
creep up. 
   
   The pods, the logs and the dashboards look healthy and the UI shows that no 
tasks are currently running. There is one dashboard that shows metrics for 
`Executor running tasks` and `Executor open slots`. We noticed that this 
dashboard was accurately representing the slowness behavior. Over a period of 
time, number of open slots would decrease, vice versa for running tasks. These 
two would never reset even when nothing is running during a long period of 
time. These metrics are coming from `base_exeuctor`:
   
   ```python
          Stats.gauge("executor.open_slots", open_slots)
           Stats.gauge("executor.queued_tasks", num_queued_tasks)
           Stats.gauge("executor.running_tasks", num_running_tasks)
   ```
   
   and `num_running_tasks` is defined as `num_running_tasks = 
len(self.running)` in `base_executor`.
   
   <img width="860" alt="Screenshot 2023-07-28 at 3 11 30 PM" 
src="https://github.com/apache/airflow/assets/108699169/4f8bfade-36b2-41c5-9e33-1fc0c45e9e36";>
   
   
   At this point we enabled some logs from `KuberenetesExecutor` under this 
[method](https://github.com/apache/airflow/blob/2.5.3/airflow/executors/kubernetes_executor.py#L610C17-L610C17)
 to see what was in `self.running`:
   
   ```python
       def sync(self) -> None:
           """Synchronize task state."""
         ####
           if self.running:
               self.log.debug("self.running: %s", self.running)  #--> this log
          ###
           self.kube_scheduler.sync()
   ```
   
   where `self.running` is defined as `self.running: set[TaskInstanceKey] = 
set()`. The log showed that somehow the tasks have been completed successfully 
in the past still exist in `self.running`. For example, a snippet of the log 
outputted on the 28th is holding on to the tasks that have already been 
successfully completed on the 24th and 27th:
   
   ```
   **time: Jul 28, 2023 @ 15:07:01.784**
   self.running: {TaskInstanceKey(dag_id='flight_history.py', 
task_id='load_file', run_id=**'manual__2023-07-24T01:06:18+00:00'**, 
try_number=1, map_index=17), TaskInstanceKey(dag_id='emd_load.py', 
task_id='processing.emd', run_id='**scheduled__2023-07-25T07:30:00+00:00'**, 
try_number=1, map_index=-1), 
   ```
   
   We validated that these tasks have been completed without any issue from the 
UI and Postgres. 
   
   Once the scheduler pod is restarted, the problem goes away, the metrics in 
Grafana dashboard reset and tasks start executing. 
   
   ### What you think should happen instead
   
   Airflow's scheduler is keeping a track of currently running tasks in memory 
and that state in some cases is not getting cleared up. The tasks have been 
completed should eventually be cleared from `running` set in 
`KubernetesExecutor` once the pod exits.
   
   ### How to reproduce
   
   Beats me. Our initial assumption was that that is a DAG implementation issue 
and some particular DAG is misbehaving. But this problem has occurred with all 
sorts of DAGs, happens for scheduled and manual runs, and is sporadic. 
   
   ### Operating System
   
   Debian GNU/ Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   # This file contains the dependencies required by DAGs and is used to build 
the Airflow docker image.
   
   # Alphabetical
   
   aiofiles==23.1.0
   aiohttp==3.8.4
   airflow-dbt>=0.4.0
   airflow-exporter==1.5.3
   anytree==2.8.0
   apache-airflow-providers-ftp==2.0.1
   apache-airflow-providers-http>=2.0.3
   apache-airflow-providers-microsoft-mssql==2.1.3
   apache-airflow-providers-snowflake>=4.0.4
   apache-airflow-providers-hashicorp==3.3.0
   apache-airflow-providers-cncf-kubernetes==5.2.2
   apache-airflow>=2.2.3
   asgiref==3.5.0
   Authlib==0.15.5
   dbt-snowflake==1.5.2
   flatdict==4.0.1
   hvac==0.11.2
   jsonschema>=4.17.3
   pandas==1.3.5
   psycopg2-binary==2.9.3
   pyOpenSSL==23.1.1
   pysftp==0.2.9
   pysmbclient==0.1.5
   python-gnupg==0.5.0
   PyYAML~=5.4.1
   requests~=2.26.0
   smbprotocol==1.9.0
   snowflake-connector-python== 3.0.4
   snowflake-sqlalchemy==1.4.7
   statsd==3.3.0
   py7zr==0.20.5
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Airflow is deployed via helm charts on EKS in AWS.
   
   ### Anything else
   
   Sprodic and no consistent pattern.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to