SameerMesiah97 opened a new issue, #61142:
URL: https://github.com/apache/airflow/issues/61142
### Apache Airflow Provider(s)
amazon
### Versions of Apache Airflow Providers
`apache-airflow-providers-amazon==9.20.0`
### Apache Airflow version
main
### Operating System
Debian GNU/Linux 12 (bookworm)
### Deployment
Other
### Deployment details
_No response_
### What happened
When using `EksCreateNodegroupOperator`, a managed nodegroup may be
successfully created even when the AWS execution role has **partial EKS
permissions**, for example lacking `eks:DescribeNodegroup`.
In this scenario, the operator successfully calls `CreateNodegroup` and the
nodegroup (and backing EC2 instances) is created in AWS. However, subsequent
steps—such as waiting for the nodegroup to become active when
`wait_for_completion=True`—fail due to insufficient permissions.
The Airflow task then fails, but the EKS managed nodegroup remains active in
AWS, along with its EC2 instances, resulting in leaked infrastructure and
ongoing cost.
This can occur, for example, when the execution role allows
`eks:CreateNodegroup` but denies `eks:DescribeNodegroup`, which is required by
the waiter used to monitor nodegroup provisioning.
### What you think should happen instead
If the operator fails after successfully creating a nodegroup (for example
due to missing `DescribeNodegroup` or other follow-up permissions), it should
make a best-effort attempt to clean up the partially created resource by
deleting the nodegroup.
Cleanup should be attempted opportunistically (i.e. only if the nodegroup
name is known and the necessary permissions are available), and failure to
clean up should not mask or replace the original exception.
### How to reproduce
1. Create an IAM role that **allows** `eks:CreateNodegroup` but **denies**
`eks:DescribeNodegroup`
2. Configure an AWS connection in Airflow using this role.
(The connection ID `aws_test_conn` is used for this reproduction.)
3. Create an EKS cluster.
(The cluster name `airflow-partial-auth-eks` is used for this
reproduction.)
4. Create an IAM role for EKS managed nodegroups.
(The role `AmazonEKSNodeRole` is used for this reproduction.)
5. Use the following DAG:
```python
from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.eks import
EksCreateNodegroupOperator
with DAG(
dag_id="eks_partial_auth_nodegroup_leak_repro",
start_date=datetime(2025, 1, 1),
schedule=None,
catchup=False,
) as dag:
create_nodegroup = EksCreateNodegroupOperator(
task_id="create_nodegroup",
aws_conn_id="aws_test_conn",
cluster_name="airflow-partial-auth-eks",
nodegroup_name="leaky-nodegroup",
nodegroup_subnets=[
"subnet-xxxxxxxxxxxxxxxxx",
"subnet-yyyyyyyyyyyyyyyyy",
],
nodegroup_role_arn="arn:aws:iam::123456789012:role/AmazonEKSNodeRole",
wait_for_completion=True, # triggers DescribeNodegroup via waiter
)
```
6. Trigger the DAG.
**Expected Result**
The task fails due to missing `eks:DescribeNodegroup` permissions, but the
managed nodegroup is successfully created and remains active in AWS. The
backing EC2 instances continue running and are not cleaned up automatically.
### Anything else
This is another instance of an AWS operator leaking resources when execution
fails after partial success due to insufficient IAM permissions. Similar
failure modes have already been identified across other AWS operators where
resources are created successfully but not cleaned up if follow-up steps fail.
Apache Airflow is now introducing best-effort cleanup behavior for multiple
AWS operators to address this class of issue. In particular,
`EC2CreateInstanceOperator` now attempts cleanup on post-creation failures (PR
#60904), and corresponding changes have been proposed for
`EMRCreateJobFlowOperator` (PR #61010) and `EcsRunTaskOperator` (#61051) .
Given this precedent, applying the same best-effort cleanup pattern to
`EksCreateNodegroupOperator` would improve consistency across AWS providers,
reduce leaked infrastructure, and make operator behavior more predictable in
environments with tightly scoped IAM roles.
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]