SameerMesiah97 opened a new pull request, #61145: URL: https://github.com/apache/airflow/pull/61145
**Description** Added best-effort cleanup for EKS managed nodegroups to ensure nodegroups are deleted when failures occur after a nodegroup has been successfully created. Previously, nodegroup creation could succeed via `create_nodegroup`, but the operator could then fail during post-creation steps (for example, when waiting for nodegroup readiness with `wait_for_completion=True` and missing `eks:DescribeNodegroup` permissions). In these cases, the Airflow task failed while the EKS managed nodegroup continued provisioning or running in AWS. Cleanup logic has now been added to the internal `_create_compute` helper. If an exception is raised after nodegroup creation during the wait phase, the operator attempts a best-effort deletion of the nodegroup. Cleanup failures are logged but do not mask or replace the original exception. **Rationale** EKS managed nodegroups are external resources whose lifecycle extends beyond the execution of the Airflow task. If nodegroup creation succeeds but subsequent steps fail, Airflow may lose the ability to observe or manage the resource, potentially leaving nodegroups running unexpectedly. Failures after nodegroup creation can occur for multiple reasons, including partial IAM permissions (for example, allowing `eks:CreateNodegroup` but denying `eks:DescribeNodegroup`, which is required by the waiter). In such cases, the nodegroup may continue provisioning even though the Airflow task has failed. This change applies **only to nodegroup creation** and does not affect cluster creation, deletion, or Fargate profiles. Cleanup is scoped narrowly to nodegroups created during the current execution and is only attempted when nodegroup creation has already completed successfully. This prevents interference with unrelated resources while avoiding orphaned EKS-managed infrastructure on post-create failures. **Notes** These series of changes intentionally avoid introducing a shared abstraction for AWS operator cleanup logic. **Resource creation, ownership tracking, and cleanup semantics vary significantly across AWS services**, and a generic solution would add complexity without clear benefit. Cleanup is therefore implemented locally where behavior and failure modes are well understood. **Tests** * Added a unit test verifying that nodegroup deletion is attempted when a failure occurs during the wait phase after successful creation. * Added a unit test ensuring that failures during cleanup do not mask or override the original exception. **Backwards Compatibility** No changes to the public API or operator parameters. Closes: #61142 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
