prateeklohchubh opened a new issue, #41487:
URL: https://github.com/apache/airflow/issues/41487
### Apache Airflow Provider(s)
amazon
### Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.24.0
Airflow==2.9.2
### Apache Airflow version
2.9.2
### Operating System
Debian GNU/Linux 12 (bookworm)
### Deployment
Other Docker-based deployment
### Deployment details
Airflow deployed on AWS ECS as docker containers with tasks run as ECS Tasks
using `EcsRunTaskOperator`
### What happened
Long running Tasks (say greater than 1 hour) run via `EcsRunTaskOperator`
execute correctly but after 60 mins or so Airflow scheduler treats them as
Zombies and retries them creating duplicate tasks.
If we set `wait_for_creation` as False and use a Sensor like
`EcsTaskStateSensor` to monitor state of ECS Task that also gets marked as a
zombie and retried
Relevant Log lines
```
[2024-08-14, 11:04:39 PDT] {ecs.py:176} INFO - Task state: RUNNING, waiting
for: STOPPED
[2024-08-14, 11:05:17 PDT] {scheduler_job_runner.py:1737} ERROR - Detected
zombie job: {'full_filepath': '/opt/airflow/dags/test_ecs_task_dag.py',
'processor_subdir': '/opt/airflow/dags', 'msg': "{'DAG Id':
'test_ecs_task_dag', 'Task Id': 'await_ecs_task_run', 'Run Id':
'manual__2024-08-14T10:49:23-07:00', 'Hostname':
'somehost.us-east-2.compute.internal', 'External Executor Id':
'20d42f77-bd77-46ec-bf16-cd4e5faae379'}", 'simple_task_instance':
<airflow.models.taskinstance.SimpleTaskInstance object at 0x7f3c22cd3dd0>,
'is_failure_callback': True} (See
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html#zombie-undead-tasks)
[2024-08-14, 11:05:39 PDT] {ecs.py:176} INFO - Task state: RUNNING, waiting
for: STOPPED
```
Relevant Airflow config for zombie
`scheduler_zombie_task_threshold = 300`
The scheduler container resources are not an issue, same for the task run
using `EcsRunTaskOperator`. Both have health CPU and Mem available to them
### What you think should happen instead
The Operator should either publish heartbeats at a faster cadence or allow
for configuring heartbeat pushes?
### How to reproduce
Any Task that takes more than 60 mins ends up getting marked as Zombie.
Sometimes even earlier than 60 mins
### Anything else
_No response_
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]