SameerMesiah97 opened a new issue, #61142:
URL: https://github.com/apache/airflow/issues/61142

   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   `apache-airflow-providers-amazon==9.20.0`
   
   ### Apache Airflow version
   
   main
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworm)
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   When using `EksCreateNodegroupOperator`, a managed nodegroup may be 
successfully created even when the AWS execution role has **partial EKS 
permissions**, for example lacking `eks:DescribeNodegroup`.
   
   In this scenario, the operator successfully calls `CreateNodegroup` and the 
nodegroup (and backing EC2 instances) is created in AWS. However, subsequent 
steps—such as waiting for the nodegroup to become active when 
`wait_for_completion=True`—fail due to insufficient permissions.
   
   The Airflow task then fails, but the EKS managed nodegroup remains active in 
AWS, along with its EC2 instances, resulting in leaked infrastructure and 
ongoing cost.
   
   This can occur, for example, when the execution role allows 
`eks:CreateNodegroup` but denies `eks:DescribeNodegroup`, which is required by 
the waiter used to monitor nodegroup provisioning.
   
   
   ### What you think should happen instead
   
   If the operator fails after successfully creating a nodegroup (for example 
due to missing `DescribeNodegroup` or other follow-up permissions), it should 
make a best-effort attempt to clean up the partially created resource by 
deleting the nodegroup.
   
   Cleanup should be attempted opportunistically (i.e. only if the nodegroup 
name is known and the necessary permissions are available), and failure to 
clean up should not mask or replace the original exception.
   
   
   ### How to reproduce
   
   1. Create an IAM role that **allows** `eks:CreateNodegroup` but **denies** 
`eks:DescribeNodegroup`
   
   2. Configure an AWS connection in Airflow using this role.
      (The connection ID `aws_test_conn` is used for this reproduction.)
   
   3. Create an EKS cluster.
      (The cluster name `airflow-partial-auth-eks` is used for this 
reproduction.)
   
   4. Create an IAM role for EKS managed nodegroups.
      (The role `AmazonEKSNodeRole` is used for this reproduction.)
   
   5. Use the following DAG:
   
   ```python
   from datetime import datetime
   
   from airflow import DAG
   from airflow.providers.amazon.aws.operators.eks import 
EksCreateNodegroupOperator
   
   
   with DAG(
       dag_id="eks_partial_auth_nodegroup_leak_repro",
       start_date=datetime(2025, 1, 1),
       schedule=None,
       catchup=False,
   ) as dag:
       create_nodegroup = EksCreateNodegroupOperator(
           task_id="create_nodegroup",
           aws_conn_id="aws_test_conn",
           cluster_name="airflow-partial-auth-eks",
           nodegroup_name="leaky-nodegroup",
           nodegroup_subnets=[
               "subnet-xxxxxxxxxxxxxxxxx",
               "subnet-yyyyyyyyyyyyyyyyy",
           ],
           
nodegroup_role_arn="arn:aws:iam::123456789012:role/AmazonEKSNodeRole",
           wait_for_completion=True,  # triggers DescribeNodegroup via waiter
       )
   ```
   6. Trigger the DAG.
   
   **Expected Result**
   
   The task fails due to missing `eks:DescribeNodegroup` permissions, but the 
managed nodegroup is successfully created and remains active in AWS. The 
backing EC2 instances continue running and are not cleaned up automatically.
   
   ### Anything else
   
   This is another instance of an AWS operator leaking resources when execution 
fails after partial success due to insufficient IAM permissions. Similar 
failure modes have already been identified across other AWS operators where 
resources are created successfully but not cleaned up if follow-up steps fail.
   
   Apache Airflow is now introducing best-effort cleanup behavior for multiple 
AWS operators to address this class of issue. In particular, 
`EC2CreateInstanceOperator` now attempts cleanup on post-creation failures (PR 
#60904), and corresponding changes have been proposed for 
`EMRCreateJobFlowOperator` (PR #61010) and `EcsRunTaskOperator` (#61051) . 
   
   Given this precedent, applying the same best-effort cleanup pattern to 
`EksCreateNodegroupOperator` would improve consistency across AWS providers, 
reduce leaked infrastructure, and make operator behavior more predictable in 
environments with tightly scoped IAM roles.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to