prateeklohchubh opened a new issue, #41487:
URL: https://github.com/apache/airflow/issues/41487

   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==8.24.0
   Airflow==2.9.2
   
   ### Apache Airflow version
   
   2.9.2
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworm)
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   Airflow deployed on AWS ECS as docker containers with tasks run as ECS Tasks 
using `EcsRunTaskOperator`
   
   ### What happened
   
   Long running Tasks (say greater than 1 hour) run via `EcsRunTaskOperator` 
execute correctly but after 60 mins or so Airflow scheduler treats them as 
Zombies and retries them creating duplicate tasks.
   
   If we set `wait_for_creation` as False and use a Sensor like 
`EcsTaskStateSensor` to monitor state of ECS Task that also gets marked as a 
zombie and retried
   
   Relevant Log lines
   ```
   [2024-08-14, 11:04:39 PDT] {ecs.py:176} INFO - Task state: RUNNING, waiting 
for: STOPPED
   [2024-08-14, 11:05:17 PDT] {scheduler_job_runner.py:1737} ERROR - Detected 
zombie job: {'full_filepath': '/opt/airflow/dags/test_ecs_task_dag.py', 
'processor_subdir': '/opt/airflow/dags', 'msg': "{'DAG Id': 
'test_ecs_task_dag', 'Task Id': 'await_ecs_task_run', 'Run Id': 
'manual__2024-08-14T10:49:23-07:00', 'Hostname': 
'somehost.us-east-2.compute.internal', 'External Executor Id': 
'20d42f77-bd77-46ec-bf16-cd4e5faae379'}", 'simple_task_instance': 
<airflow.models.taskinstance.SimpleTaskInstance object at 0x7f3c22cd3dd0>, 
'is_failure_callback': True} (See 
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks)
   [2024-08-14, 11:05:39 PDT] {ecs.py:176} INFO - Task state: RUNNING, waiting 
for: STOPPED
   ```
   Relevant Airflow config for zombie
   `scheduler_zombie_task_threshold = 300` 
   
   
   The scheduler container resources are not an issue, same for the task run 
using `EcsRunTaskOperator`. Both have health CPU and Mem available to them 
   
   ### What you think should happen instead
   
   The Operator should either publish heartbeats at a faster cadence or allow 
for configuring heartbeat pushes?
   
   ### How to reproduce
   
   Any Task that takes more than 60 mins ends up getting marked as Zombie. 
Sometimes even earlier than 60 mins 
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to