SameerMesiah97 opened a new pull request, #61145:
URL: https://github.com/apache/airflow/pull/61145

   **Description**
   
   Added best-effort cleanup for EKS managed nodegroups to ensure nodegroups 
are deleted when failures occur after a nodegroup has been successfully created.
   
   Previously, nodegroup creation could succeed via `create_nodegroup`, but the 
operator could then fail during post-creation steps (for example, when waiting 
for nodegroup readiness with `wait_for_completion=True` and missing 
`eks:DescribeNodegroup` permissions). In these cases, the Airflow task failed 
while the EKS managed nodegroup continued provisioning or running in AWS.
   
   Cleanup logic has now been added to the internal `_create_compute` helper. 
If an exception is raised after nodegroup creation during the wait phase, the 
operator attempts a best-effort deletion of the nodegroup. Cleanup failures are 
logged but do not mask or replace the original exception.
   
   **Rationale**
   
   EKS managed nodegroups are external resources whose lifecycle extends beyond 
the execution of the Airflow task. If nodegroup creation succeeds but 
subsequent steps fail, Airflow may lose the ability to observe or manage the 
resource, potentially leaving nodegroups running unexpectedly.
   
   Failures after nodegroup creation can occur for multiple reasons, including 
partial IAM permissions (for example, allowing `eks:CreateNodegroup` but 
denying `eks:DescribeNodegroup`, which is required by the waiter). In such 
cases, the nodegroup may continue provisioning even though the Airflow task has 
failed.
   
   This change applies **only to nodegroup creation** and does not affect 
cluster creation, deletion, or Fargate profiles. Cleanup is scoped narrowly to 
nodegroups created during the current execution and is only attempted when 
nodegroup creation has already completed successfully. This prevents 
interference with unrelated resources while avoiding orphaned EKS-managed 
infrastructure on post-create failures.
   
   **Notes**
   
   These series of changes intentionally avoid introducing a shared abstraction 
for AWS operator cleanup logic. **Resource creation, ownership tracking, and 
cleanup semantics vary significantly across AWS services**, and a generic 
solution would add complexity without clear benefit. Cleanup is therefore 
implemented locally where behavior and failure modes are well understood.
   
   **Tests**
   
   * Added a unit test verifying that nodegroup deletion is attempted when a 
failure occurs during the wait phase after successful creation.
   * Added a unit test ensuring that failures during cleanup do not mask or 
override the original exception.
   
   **Backwards Compatibility**
   
   No changes to the public API or operator parameters.
   
   Closes: #61142
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to