GitHub user pvieito added a comment to the discussion: Airflow task stuck on 
DEFERRED status

Hi @kunaljubce!

- No updates in that time window.
- All our tasks are vanilla Airflow AWS operators and super-simple DAGs based 
on some templates (basically `BatchOperator` & `GlueOperator`):

```python
with DAG(
        dag_id=dag_id,
        schedule=dag_schedule,
        start_date=pendulum.parse(dag_schedule_start_date),
        tags=dag_deployment_tags,
        catchup=False,
) as dag:
    batch_job = BatchOperator(
        task_id="batch_job",
        job_name=job_name,
        job_definition=job_definition_arn,
        job_queue=job_queue_name,
        tags=job_tags,
        aws_conn_id=job_aws_connection_id,
        container_overrides=dict(
            environment=aws_batch_job_environment_variables,
        ))

    log_batch_job() >> batch_job
```

- All deferred task started to mis-behave indefinetly and all new deferred task 
also mis-behave.
- Some extra full logs of tasks being stuck in DEFERRED without any waiter logs:

- With `GlueOperator`:
```
ip-10-5-146-175.eu-west-1.compute.internal
*** Reading remote log from Cloudwatch log_group: 
airflow-DT-LAB-Airflow-Environment-Task log_stream: 
dag_id=DT-LAB-Autostop-Schedule-PushProximity-1/run_id=manual__2024-11-04T09_55_51+00_00/task_id=glue_job/attempt=1.log.
[2024-11-04, 09:55:59 UTC] {local_task_job_runner.py:123} ▼ Pre task execution 
logs
[2024-11-04, 09:55:59 UTC] {taskinstance.py:2612} INFO - Dependencies all met 
for dep_context=non-requeueable deps ti=<TaskInstance: 
DT-LAB-Autostop-Schedule-PushProximity-1.glue_job 
manual__2024-11-04T09:55:51+00:00 [queued]>
[2024-11-04, 09:55:59 UTC] {taskinstance.py:2612} INFO - Dependencies all met 
for dep_context=requeueable deps ti=<TaskInstance: 
DT-LAB-Autostop-Schedule-PushProximity-1.glue_job 
manual__2024-11-04T09:55:51+00:00 [queued]>
[2024-11-04, 09:55:59 UTC] {taskinstance.py:2865} INFO - Starting attempt 1 of 1
[2024-11-04, 09:55:59 UTC] {taskinstance.py:2888} INFO - Executing 
<Task(GlueJobOperator): glue_job> on 2024-11-04 09:55:51+00:00
[2024-11-04, 09:55:59 UTC] {standard_task_runner.py:72} INFO - Started process 
2450 to run task
[2024-11-04, 09:55:59 UTC] {standard_task_runner.py:104} INFO - Running: 
['airflow', 'tasks', 'run', 'DT-LAB-Autostop-Schedule-PushProximity-1', 
'glue_job', 'manual__2024-11-04T09:55:51+00:00', '--job-id', '39075', '--raw', 
'--subdir', 'DAGS_FOLDER/DT-LAB-Autostop-Schedule-PushProximity-1.py', 
'--cfg-path', '/tmp/tmpceiye8fs']
[2024-11-04, 09:55:59 UTC] {standard_task_runner.py:105} INFO - Job 39075: 
Subtask glue_job
[2024-11-04, 09:55:59 UTC] {task_command.py:467} INFO - Running <TaskInstance: 
DT-LAB-Autostop-Schedule-PushProximity-1.glue_job 
manual__2024-11-04T09:55:51+00:00 [running]> on host 
ip-10-5-146-175.eu-west-1.compute.internal
[2024-11-04, 09:56:00 UTC] {taskinstance.py:3131} INFO - Exporting env vars: 
AIRFLOW_CTX_DAG_OWNER='airflow' 
AIRFLOW_CTX_DAG_ID='DT-LAB-Autostop-Schedule-PushProximity-1' 
AIRFLOW_CTX_TASK_ID='glue_job' 
AIRFLOW_CTX_EXECUTION_DATE='2024-11-04T09:55:51+00:00' 
AIRFLOW_CTX_TRY_NUMBER='1' 
AIRFLOW_CTX_DAG_RUN_ID='manual__2024-11-04T09:55:51+00:00'
[2024-11-04, 09:56:00 UTC] {taskinstance.py:731} ▲▲▲ Log group end
[2024-11-04, 09:56:00 UTC] {glue.py:188} INFO - Initializing AWS Glue Job: 
DT-LAB-Autostop-Schedule-PushProximity-1. Wait for completion: True
[2024-11-04, 09:56:00 UTC] {glue.py:365} INFO - Checking if job already exists: 
DT-LAB-Autostop-Schedule-PushProximity-1
[2024-11-04, 09:56:00 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data'
[2024-11-04, 09:56:02 UTC] {glue.py:209} INFO - You can monitor this Glue Job 
run at: 
https://console.aws.amazon.com/gluestudio/home?region=eu-west-1#/job/DT-LAB-Autostop-Schedule-PushProximity-1/run/jr_d8efd3e7c7ca097b048db1e2e4342cd3e8eb3f03a225335bdef726c3ff63abb3
[2024-11-04, 09:56:02 UTC] {taskinstance.py:288} INFO - Pausing task as 
DEFERRED. dag_id=DT-LAB-Autostop-Schedule-PushProximity-1, task_id=glue_job, 
run_id=manual__2024-11-04T09:55:51+00:00, execution_date=20241104T095551, 
start_date=20241104T095559
[2024-11-04, 09:56:02 UTC] {taskinstance.py:340} ▼ Post task execution logs
[2024-11-04, 09:56:02 UTC] {local_task_job_runner.py:260} INFO - Task exited 
with return code 100 (task deferral)
[2024-11-04, 09:56:02 UTC] {local_task_job_runner.py:245} ▲▲▲ Log group end
[2024-11-04, 09:56:04 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data'
[2024-11-04, 09:56:05 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data'
```

- With `BatchOperator`:

```
ip-10-5-146-175.eu-west-1.compute.internal
*** Reading remote log from Cloudwatch log_group: 
airflow-DT-LAB-Airflow-Environment-Task log_stream: 
dag_id=DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1/run_id=scheduled__2024-11-04T09_00_00+00_00/task_id=batch_job/attempt=1.log.
[2024-11-04, 10:00:13 UTC] {local_task_job_runner.py:123} ▼ Pre task execution 
logs
[2024-11-04, 10:00:13 UTC] {taskinstance.py:2612} INFO - Dependencies all met 
for dep_context=non-requeueable deps ti=<TaskInstance: 
DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1.batch_job 
scheduled__2024-11-04T09:00:00+00:00 [queued]>
[2024-11-04, 10:00:13 UTC] {taskinstance.py:2612} INFO - Dependencies all met 
for dep_context=requeueable deps ti=<TaskInstance: 
DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1.batch_job 
scheduled__2024-11-04T09:00:00+00:00 [queued]>
[2024-11-04, 10:00:13 UTC] {taskinstance.py:2865} INFO - Starting attempt 1 of 1
[2024-11-04, 10:00:13 UTC] {taskinstance.py:2888} INFO - Executing 
<Task(BatchOperator): batch_job> on 2024-11-04 09:00:00+00:00
[2024-11-04, 10:00:13 UTC] {standard_task_runner.py:72} INFO - Started process 
2486 to run task
[2024-11-04, 10:00:13 UTC] {standard_task_runner.py:104} INFO - Running: 
['airflow', 'tasks', 'run', 'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1', 
'batch_job', 'scheduled__2024-11-04T09:00:00+00:00', '--job-id', '39077', 
'--raw', '--subdir', 
'DAGS_FOLDER/DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1.py', 
'--cfg-path', '/tmp/tmp7b1ya_1l']
[2024-11-04, 10:00:13 UTC] {standard_task_runner.py:105} INFO - Job 39077: 
Subtask batch_job
[2024-11-04, 10:00:13 UTC] {task_command.py:467} INFO - Running <TaskInstance: 
DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1.batch_job 
scheduled__2024-11-04T09:00:00+00:00 [running]> on host 
ip-10-5-146-175.eu-west-1.compute.internal
[2024-11-04, 10:00:14 UTC] {taskinstance.py:3131} INFO - Exporting env vars: 
AIRFLOW_CTX_DAG_OWNER='airflow' 
AIRFLOW_CTX_DAG_ID='DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1' 
AIRFLOW_CTX_TASK_ID='batch_job' 
AIRFLOW_CTX_EXECUTION_DATE='2024-11-04T09:00:00+00:00' 
AIRFLOW_CTX_TRY_NUMBER='1' 
AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-11-04T09:00:00+00:00'
[2024-11-04, 10:00:14 UTC] {taskinstance.py:731} ▲▲▲ Log group end
[2024-11-04, 10:00:14 UTC] {batch.py:303} INFO - Running AWS Batch job - job 
definition: 
arn:aws:batch:eu-west-1:532132528437:job-definition/DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1:358
 - on queue DT-LAB-BatchExecution-Standard
[2024-11-04, 10:00:14 UTC] {batch.py:310} INFO - AWS Batch job - container 
overrides: {'environment': [{'name': '_DATATOOLS_DAG_DEPLOYMENT_IDENTIFIER', 
'value': 'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1'}, {'name': 
'_DATATOOLS_DAG_DEPLOYMENT_TARGET', 'value': 'airflow'}, {'name': 
'DATATOOLS_ENVIRONMENT', 'value': 'LAB'}, {'name': 
'_DATATOOLS_AIRFLOW_DEPLOYMENT_DAG_ID', 'value': 
'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1'}, {'name': 
'_DATATOOLS_AIRFLOW_DEPLOYMENT_ENVIRONMENT_NAME', 'value': 
'shopfully-data-execution'}, {'name': 
'_DATATOOLS_AIRFLOW_DEPLOYMENT_DATA_INTERVAL_START_DATETIME', 'value': 
'2024-11-04 09:00:00+00:00'}, {'name': 
'_DATATOOLS_AIRFLOW_DEPLOYMENT_DATA_INTERVAL_END_DATETIME', 'value': 
'2024-11-04 10:00:00+00:00'}, {'name': 
'_DATATOOLS_PROCESS_EXTRA_ENVIRONMENT_BLOB', 'value': ''}]}
[2024-11-04, 10:00:14 UTC] {base.py:84} INFO - Retrieving connection 
'aws_shopfully_data_execution'
[2024-11-04, 10:00:15 UTC] {batch.py:347} INFO - AWS Batch job 
(c6157b39-4197-4a98-9a0d-2310866a2495) started: {'ResponseMetadata': 
{'RequestId': 'f7570782-3748-4d95-97b6-2fcc29820dd4', 'HTTPStatusCode': 200, 
'HTTPHeaders': {'date': 'Mon, 04 Nov 2024 10:00:15 GMT', 'content-type': 
'application/json', 'content-length': '198', 'connection': 'keep-alive', 
'x-amzn-requestid': 'f7570782-3748-4d95-97b6-2fcc29820dd4', 
'access-control-allow-origin': '*', 'x-amz-apigw-id': 'At0vfHotjoEEbJQ=', 
'access-control-expose-headers': 
'X-amzn-errortype,X-amzn-requestid,X-amzn-errormessage,X-amzn-trace-id,X-amz-apigw-id,date',
 'x-amzn-trace-id': 'Root=1-67289b2f-4f45249a51c82f7f02fdaa9d'}, 
'RetryAttempts': 0}, 'jobArn': 
'arn:aws:batch:eu-west-1:532132528437:job/c6157b39-4197-4a98-9a0d-2310866a2495',
 'jobName': 'DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1', 'jobId': 
'c6157b39-4197-4a98-9a0d-2310866a2495'}
[2024-11-04, 10:00:15 UTC] {taskinstance.py:288} INFO - Pausing task as 
DEFERRED. dag_id=DT-LAB-DataTools-Schedule-RemoteExecutionCheck-1, 
task_id=batch_job, run_id=scheduled__2024-11-04T09:00:00+00:00, 
execution_date=20241104T090000, start_date=20241104T100013
[2024-11-04, 10:00:15 UTC] {taskinstance.py:340} ▼ Post task execution logs
[2024-11-04, 10:00:16 UTC] {local_task_job_runner.py:260} INFO - Task exited 
with return code 100 (task deferral)
[2024-11-04, 10:00:16 UTC] {local_task_job_runner.py:245} ▲▲▲ Log group end
```


GitHub link: 
https://github.com/apache/airflow/discussions/43835#discussioncomment-11192351

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to