hbc-acai opened a new issue, #40106:
URL: https://github.com/apache/airflow/issues/40106
### Apache Airflow version
2.9.1
### If "Other Airflow 2 version" selected, which one?
_No response_
### What happened?
After upgrading to 2.9.1, we find tasks are stuck at scheduled state after
about 1 hour scheduler started. During the first hour, all tasks are running
fine. Then I restarted the scheduler, and it successfully moved the "stuck"
task instances to queued state and then run them. But new tasks got stuck again
after one hour.
This is reproduceable in my production cluster. It happens every time after
we restart our scheduler. But we are not able to replicate this in our dev
cluster.
There are no errors in the scheduler log. Here is some logs where things
went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran
successufully, but task got stuck in the scheduled state. In the below log I
only found information about the 2 tasks (database_setup, positions_extract )
that ran successfully.
```
[2024-06-07T01:12:52.113+0000] {kubernetes_executor.py:240} INFO - Found 0
queued task instances
[2024-06-07T01:13:52.199+0000] {kubernetes_executor.py:240} INFO - Found 0
queued task instances
[2024-06-07T01:14:52.284+0000] {kubernetes_executor.py:240} INFO - Found 0
queued task instances
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:417} INFO - 2 tasks
up for execution:
<TaskInstance: update_risk_exposure_store.database_setup
scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
<TaskInstance: update_risk_exposure_store.positions_extract
scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG
update_risk_exposure_store has 0/16 running and queued tasks
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG
update_risk_exposure_store has 1/16 running and queued tasks
[2024-06-07T01:15:37.977+0000] {scheduler_job_runner.py:596} INFO - Setting
the following tasks to queued state:
<TaskInstance: update_risk_exposure_store.database_setup
scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
<TaskInstance: update_risk_exposure_store.positions_extract
scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending
TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='database_setup',
run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to
executor with priority 25 and queue default
[2024-06-07T01:15:37.980+0000] {base_executor.py:149} INFO - Adding to
queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store',
'database_setup', 'scheduled__2024-05-28T10:10:00+00:00', '--local',
'--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending
TaskInstanceKey(dag_id='update_risk_exposure_store',
task_id='positions_extract', run_id='scheduled__2024-05-28T10:10:00+00:00',
try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.981+0000] {base_executor.py:149} INFO - Adding to
queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store',
'positions_extract', 'scheduled__2024-05-28T10:10:00+00:00', '--local',
'--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
```
### What you think should happen instead?
_No response_
### How to reproduce
I can easily reproduce it in my production cluster. But I cannot reproduce
it in our dev cluster. Both clusters have almost exactly the same setup.
### Operating System
Azure Kubernetes Service
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
Using apache-airflow:2.9.1-python3.10 image
### Anything else?
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]