katguev opened a new issue #14556:
URL: https://github.com/apache/airflow/issues/14556
**Apache Airflow version**: `2.0.1`
**Kubernetes version (if you are using kubernetes)** (use `kubectl
version`):
`Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.1",
GitCommit:"7879fc12a63337efff607952a323df90cdc7a335", GitTreeState:"clean",
BuildDate:"2020-04-10T21:53:58Z", GoVersion:"go1.14.2", Compiler:"gc",
Platform:"darwin/amd64"}`
`Server Version: version.Info{Major:"1", Minor:"16+",
GitVersion:"v1.16.15-gke.7800",
GitCommit:"cef3156c566a1d1a4b23ee360a760f45bfbaaac1", GitTreeState:"clean",
BuildDate:"2020-12-14T09:12:37Z", GoVersion:"go1.13.15b4", Compiler:"gc",
Platform:"linux/amd64"}`
**Environment**:
- **Cloud provider or hardware configuration**: `GKE`
- **OS** (e.g. from /etc/os-release):
- **Kernel** (e.g. `uname -a`):
- **Install tools**:
- **Others**: We use scheduler HA with 2 instances.
**What happened**:
- We are running Airflow 2.0.1 using KubernetesExecutor and PostgreSQL 9.6.2.
- Task is `up_for_retry` after its worker pod fails to start
- It gets stuck in `queued` state
- It runs only after a scheduler restart.
**What you expected to happen**:
- Task gets rescheduled and runs successfully.
**How to reproduce it**:
**Anything else we need to know**:
Logs below.
<details>
Its upstream task succeeds and schedules it via fast follow.
`[2021-03-01 17:08:51,306] {taskinstance.py:1166} INFO - Marking task as
SUCCESS. dag_id=datalake_dag_id, task_id=processidlogs,
execution_date=20210301T163000, start_date=20210301T170546,
end_date=20210301T170851`
`[2021-03-01 17:08:51,339] {taskinstance.py:1220} INFO - 1 downstream tasks
scheduled from follow-on schedule check`
`[2021-03-01 17:08:51,357] {local_task_job.py:146} INFO - Task exited with
return code 0`
Task is attempted to be run.
`[2021-03-01 17:08:52,229] {scheduler_job.py:1105} INFO - Sending
TaskInstanceKey(dag_id='datalake_dag_id', task_id='delta_id_logs',
execution_date=datetime.datetime(2021, 3, 1, 16, 30, tzinfo=Timezone('UTC')),
try_number=1) to executor with priority 1 and queue default`
`[2021-03-01 17:08:52,308] {kubernetes_executor.py:306} DEBUG - Kubernetes
running for command ['airflow', 'tasks', 'run', 'datalake_dag_id',
'delta_id_logs', '2021-03-01T16:30:00+00:00', '--local', '--pool',
'default_pool', '--subdir',
'/opt/airflow/dags/data_lake/some_tasks/some_tasks.py']`
`[2021-03-01 17:08:52,332] {scheduler_job.py:1206} INFO - Executor reports
execution of datalake_dag_id.delta_id_logs execution_date=2021-03-01
16:30:00+00:00 exited with status queued for try_number 1`
Pod fails to start.
`[2021-03-01 17:12:17,319] {kubernetes_executor.py:197} INFO - Event: Failed
to start pod datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a, will
reschedule`
It is put as up_for_reschedule.
`[2021-03-01 17:12:23,912] {kubernetes_executor.py:343} DEBUG - Processing
task ('datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a', 'prod',
'up_for_reschedule', {'dag_id': 'datalake_dag_id', 'task_id': 'delta_id_logs',
'execution_date': '2021-03-01T16:30:00+00:00', 'try_number': '1'},
'1172208829')`
`[2021-03-01 17:12:23,930] {kubernetes_executor.py:528} INFO - Changing
state of (TaskInstanceKey(dag_id='datalake_dag_id', task_id='delta_id_logs',
execution_date=datetime.datetime(2021, 3, 1, 16, 30, tzinfo=tzlocal()),
try_number=1), 'up_for_reschedule',
'datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a', 'prod',
'1172208829') to up_for_reschedule`
`[2021-03-01 17:12:23,941] {scheduler_job.py:1206} INFO - Executor reports
execution of datalake_dag_id.delta_id_logs execution_date=2021-03-01
16:30:00+00:00 exited with status up_for_reschedule for try_number 1`
A few minutes later, another scheduler finds it in queued state.
`[2021-03-01 17:15:39,177] {taskinstance.py:851} DEBUG - Dependencies all
met for <TaskInstance: datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00
[queued]>`
`[2021-03-01 17:15:40,477] {taskinstance.py:866} DEBUG - <TaskInstance:
datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 [queued]> dependency
'Not In Retry Period' PASSED: True, The context specified that being in a retry
period was permitted.`
`[2021-03-01 17:15:40,478] {taskinstance.py:866} DEBUG - <TaskInstance:
datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 [queued]> dependency
'Previous Dagrun State' PASSED: True, The task did not have depends_on_past
set.`
It stays in that state for another hour and a half until the scheduler is
restarted.
Finally, it is rescheduled.
`[2021-03-01 18:58:10,475] {kubernetes_executor.py:463} INFO - TaskInstance:
<TaskInstance: datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00
[queued]> found in queued state but was not launched, rescheduling`
</details>
Other tasks run fine while that one is stuck in queued state. We have a cron
job to restart the scheduler as a hack to recover from when such cases happen,
but we would like to avoid it as much as possible.
We run 60 DAGs with 50-100 tasks each every 30 minutes. We have been seeing
this issue at least once daily since we upgraded to Airflow 2.0.1.
I understand there are some open issues about the scheduler or tasks getting
stuck. But I could not tell if this is related, since hundreds of other tasks
run as expected. Apologies if this turns out to be a duplicate of an existing
issue. Thank you.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]