[GitHub] [airflow] katguev opened a new issue #14556: Task stuck in queued state after pod fails to start

GitBox Tue, 02 Mar 2021 01:03:08 -0800


katguev opened a new issue #14556:
URL: https://github.com/apache/airflow/issues/14556



   **Apache Airflow version**: `2.0.1`
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl 
version`):  
   `Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.1", 
GitCommit:"7879fc12a63337efff607952a323df90cdc7a335", GitTreeState:"clean", 
BuildDate:"2020-04-10T21:53:58Z", GoVersion:"go1.14.2", Compiler:"gc", 
Platform:"darwin/amd64"}`  
   
   `Server Version: version.Info{Major:"1", Minor:"16+", 
GitVersion:"v1.16.15-gke.7800", 
GitCommit:"cef3156c566a1d1a4b23ee360a760f45bfbaaac1", GitTreeState:"clean", 
BuildDate:"2020-12-14T09:12:37Z", GoVersion:"go1.13.15b4", Compiler:"gc", 
Platform:"linux/amd64"}`
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: `GKE`
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**: We use scheduler HA with 2 instances. 
   
   **What happened**:
   - We are running Airflow 2.0.1 using KubernetesExecutor and PostgreSQL 9.6.2.
   - Task is `up_for_retry` after its worker pod fails to start
   - It gets stuck in `queued` state
   - It runs only after a scheduler restart. 
   
   **What you expected to happen**:
   - Task gets rescheduled and runs successfully.
   
   **How to reproduce it**:
   
   
   **Anything else we need to know**:
   
   Logs below. 
   
   <details>
   
   
   Its upstream task succeeds and schedules it via fast follow.  
   `[2021-03-01 17:08:51,306] {taskinstance.py:1166} INFO - Marking task as 
SUCCESS. dag_id=datalake_dag_id, task_id=processidlogs, 
execution_date=20210301T163000, start_date=20210301T170546, 
end_date=20210301T170851`
   
   `[2021-03-01 17:08:51,339] {taskinstance.py:1220} INFO - 1 downstream tasks 
scheduled from follow-on schedule check`
   
   `[2021-03-01 17:08:51,357] {local_task_job.py:146} INFO - Task exited with 
return code 0`
   
   
   Task is attempted to be run.  
   
   `[2021-03-01 17:08:52,229] {scheduler_job.py:1105} INFO - Sending 
TaskInstanceKey(dag_id='datalake_dag_id', task_id='delta_id_logs', 
execution_date=datetime.datetime(2021, 3, 1, 16, 30, tzinfo=Timezone('UTC')), 
try_number=1) to executor with priority 1 and queue default`
   
   `[2021-03-01 17:08:52,308] {kubernetes_executor.py:306} DEBUG - Kubernetes 
running for command ['airflow', 'tasks', 'run', 'datalake_dag_id', 
'delta_id_logs', '2021-03-01T16:30:00+00:00', '--local', '--pool', 
'default_pool', '--subdir', 
'/opt/airflow/dags/data_lake/some_tasks/some_tasks.py']`
   
   `[2021-03-01 17:08:52,332] {scheduler_job.py:1206} INFO - Executor reports 
execution of datalake_dag_id.delta_id_logs execution_date=2021-03-01 
16:30:00+00:00 exited with status queued for try_number 1`
   
   
   
   Pod fails to start.  
   
   `[2021-03-01 17:12:17,319] {kubernetes_executor.py:197} INFO - Event: Failed 
to start pod datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a, will 
reschedule`
   
   
   It is put as up_for_reschedule.  
   
   `[2021-03-01 17:12:23,912] {kubernetes_executor.py:343} DEBUG - Processing 
task ('datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a', 'prod', 
'up_for_reschedule', {'dag_id': 'datalake_dag_id', 'task_id': 'delta_id_logs', 
'execution_date': '2021-03-01T16:30:00+00:00', 'try_number': '1'}, 
'1172208829')`
   
   
   `[2021-03-01 17:12:23,930] {kubernetes_executor.py:528} INFO - Changing 
state of (TaskInstanceKey(dag_id='datalake_dag_id', task_id='delta_id_logs', 
execution_date=datetime.datetime(2021, 3, 1, 16, 30, tzinfo=tzlocal()), 
try_number=1), 'up_for_reschedule', 
'datalakedagiddeltaidlogs.5fa98ae3856f4cb4b6c8810ac13e5c6a', 'prod', 
'1172208829') to up_for_reschedule`
   
   `[2021-03-01 17:12:23,941] {scheduler_job.py:1206} INFO - Executor reports 
execution of datalake_dag_id.delta_id_logs execution_date=2021-03-01 
16:30:00+00:00 exited with status up_for_reschedule for try_number 1`
   
   
   A few minutes later, another scheduler finds it in queued state.  
   
   `[2021-03-01 17:15:39,177] {taskinstance.py:851} DEBUG - Dependencies all 
met for <TaskInstance: datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 
[queued]>`
   
   `[2021-03-01 17:15:40,477] {taskinstance.py:866} DEBUG - <TaskInstance: 
datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 [queued]> dependency 
'Not In Retry Period' PASSED: True, The context specified that being in a retry 
period was permitted.`
   
   `[2021-03-01 17:15:40,478] {taskinstance.py:866} DEBUG - <TaskInstance: 
datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 [queued]> dependency 
'Previous Dagrun State' PASSED: True, The task did not have depends_on_past 
set.`
   
   
   It stays in that state for another hour and a half until the scheduler is 
restarted.  
   
   Finally, it is rescheduled.  
   `[2021-03-01 18:58:10,475] {kubernetes_executor.py:463} INFO - TaskInstance: 
<TaskInstance: datalake_dag_id.delta_id_logs 2021-03-01 16:30:00+00:00 
[queued]> found in queued state but was not launched, rescheduling`
   
   </details>
   
   Other tasks run fine while that one is stuck in queued state. We have a cron 
job to restart the scheduler as a hack to recover from when such cases happen, 
but we would like to avoid it as much as possible. 
   
   We run 60 DAGs with 50-100 tasks each every 30 minutes. We have been seeing 
this issue at least once daily since we upgraded to Airflow 2.0.1. 
   
   I understand there are some open issues about the scheduler or tasks getting 
stuck. But I could not tell if this is related, since hundreds of other tasks 
run as expected. Apologies if this turns out to be a duplicate of an existing 
issue. Thank you. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] katguev opened a new issue #14556: Task stuck in queued state after pod fails to start

Reply via email to