hbc-acai opened a new issue, #40106:
URL: https://github.com/apache/airflow/issues/40106

   ### Apache Airflow version
   
   2.9.1
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   After upgrading to 2.9.1, we find tasks are stuck at scheduled state after 
about 1 hour scheduler started. During the first hour, all tasks are running 
fine. Then I restarted the scheduler, and it successfully moved the "stuck" 
task instances to queued state and then run them. But new tasks got stuck again 
after one hour. 
   
   This is reproduceable in my production cluster. It happens every time after 
we restart our scheduler. But we are not able to replicate this in our dev 
cluster. 
   
   There are no errors in the scheduler log. Here is some logs where things 
went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran 
successufully, but task got stuck in the scheduled state. In the below log I 
only found information about the 2 tasks (database_setup, positions_extract ) 
that ran successfully. 
   
   
   ```
   [2024-06-07T01:12:52.113+0000] {kubernetes_executor.py:240} INFO - Found 0 
queued task instances
   [2024-06-07T01:13:52.199+0000] {kubernetes_executor.py:240} INFO - Found 0 
queued task instances
   [2024-06-07T01:14:52.284+0000] {kubernetes_executor.py:240} INFO - Found 0 
queued task instances
   [2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:417} INFO - 2 tasks 
up for execution:
           <TaskInstance: update_risk_exposure_store.database_setup 
scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
           <TaskInstance: update_risk_exposure_store.positions_extract 
scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
   [2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG 
update_risk_exposure_store has 0/16 running and queued tasks
   [2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG 
update_risk_exposure_store has 1/16 running and queued tasks
   [2024-06-07T01:15:37.977+0000] {scheduler_job_runner.py:596} INFO - Setting 
the following tasks to queued state:
           <TaskInstance: update_risk_exposure_store.database_setup 
scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
           <TaskInstance: update_risk_exposure_store.positions_extract 
scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
   [2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending 
TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='database_setup', 
run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to 
executor with priority 25 and queue default
   [2024-06-07T01:15:37.980+0000] {base_executor.py:149} INFO - Adding to 
queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 
'database_setup', 'scheduled__2024-05-28T10:10:00+00:00', '--local', 
'--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
   [2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending 
TaskInstanceKey(dag_id='update_risk_exposure_store', 
task_id='positions_extract', run_id='scheduled__2024-05-28T10:10:00+00:00', 
try_number=3, map_index=-1) to executor with priority 25 and queue default
   [2024-06-07T01:15:37.981+0000] {base_executor.py:149} INFO - Adding to 
queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 
'positions_extract', 'scheduled__2024-05-28T10:10:00+00:00', '--local', 
'--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
   ```
   
   ### What you think should happen instead?
   
   _No response_
   
   ### How to reproduce
   
   I can easily reproduce it in my production cluster. But I cannot reproduce 
it in our dev cluster. Both clusters have almost exactly the same setup. 
   
   ### Operating System
   
   Azure Kubernetes Service 
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Using apache-airflow:2.9.1-python3.10  image
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to