SameerMesiah97 opened a new pull request, #61051:
URL: https://github.com/apache/airflow/pull/61051

   **Description**
   
   Added best-effort cleanup to `EcsRunTaskOperator` to ensure ECS tasks are 
stopped when failures occur after a task has been successfully started.
   
   Previously, the operator could successfully start an ECS task via `RunTask` 
and then fail during post-start steps (for example, when waiting for task 
completion with `wait_for_completion=True` and missing `ecs:DescribeTasks` 
permissions). In these cases, the Airflow task failed while the ECS task 
continued running in AWS.
   
   The operator now attempts to stop any ECS task that was started by the 
current task instance if an exception is raised after task start. Cleanup is 
performed opportunistically and does not mask or replace the original exception 
if stopping the task fails.
   
    **Rationale**
   
   `EcsRunTaskOperator` manages the lifecycle of an external resource whose 
execution extends beyond the lifetime of the Airflow task. If task start 
succeeds but subsequent execution steps fail, Airflow can no longer reliably 
observe or manage the running ECS task, potentially leaving resources running 
unexpectedly.
   
   Failures after task start can occur for multiple reasons, including IAM 
permission errors (for example, missing `ecs:DescribeTasks`) or loss of access 
to systems used during task execution. Attempting best-effort cleanup in these 
scenarios avoids leaving unmanaged ECS tasks running while preserving existing 
failure semantics.
   
   Cleanup is only attempted when the operator can confidently determine that 
the ECS task was started by the current execution. This is achieved by tracking 
whether the task was started during the current run and using the task ARN 
returned by `RunTask`. This avoids interfering with pre-existing tasks in 
reattach scenarios while still preventing resource leaks on post-start failures.
   
   **Tests**
   
   * Added a unit test verifying that an ECS task is stopped when a failure 
occurs after task start.
   * Added a unit test ensuring that failures during cleanup do not mask or 
override the original exception.
   
   **Backwards Compatibility**
   
   No changes to the public API or operator parameters.
   
   Closes: #61050


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to