SameerMesiah97 opened a new issue, #61324:
URL: https://github.com/apache/airflow/issues/61324

   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   `apache-airflow-providers-amazon>=9.21.0rc1`
   
   ### Apache Airflow version
   
   main
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworm)
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   When using `RedshiftCreateClusterOperator`, a Redshift cluster may be 
successfully created even when the AWS execution role has **partial Redshift 
permissions**, for example lacking `redshift:DescribeClusters`.
   
   In this scenario, the operator successfully calls `create_cluster` and the 
Redshift cluster begins provisioning in AWS. However, subsequent steps—such as 
waiting for the cluster to become available when 
`wait_for_completion=True`—fail due to insufficient permissions.
   
   The Airflow task then fails, but the Redshift cluster continues provisioning 
or remains active in AWS, resulting in leaked infrastructure and ongoing cost.
   
   This can occur, for example, when the execution role allows 
`redshift:CreateCluster` but explicitly denies `redshift:DescribeClusters`, 
which is required by the waiter used to monitor cluster availability.
   
   ### What you think should happen instead
   
   If the operator fails after successfully initiating cluster creation (for 
example due to missing `DescribeClusters` or other follow-up permissions), it 
should make a **best-effort attempt to clean up** the partially created 
resource by deleting the cluster.
   
   Cleanup should be attempted opportunistically (i.e. only if the cluster 
identifier is known and the necessary permissions are available), and failure 
to clean up should **not mask or replace the original exception**.
   
   ### How to reproduce
   
   1. Create an IAM role that allows `redshift:CreateCluster` but denies 
`redshift:DescribeClusters`.
   
   2. Configure an AWS connection in Airflow using this role.
      (The connection ID `aws_test_conn` is used for this reproduction.)
   
   3. Ensure a valid Redshift cluster subnet group exists.
      (For example: `example-subnet-group`.)
   
   4. Use the following DAG:
   
   ```python
   from datetime import datetime
   
   from airflow import DAG
   from airflow.providers.amazon.aws.operators.redshift_cluster import (
       RedshiftCreateClusterOperator,
   )
   
   with DAG(
       dag_id="redshift_partial_auth_cluster_leak_repro",
       start_date=datetime(2025, 1, 1),
       schedule=None,
       catchup=False,
   ) as dag:
       create_cluster = RedshiftCreateClusterOperator(
           task_id="create_redshift_cluster",
           aws_conn_id="aws_test_conn",
           cluster_identifier="leaky-redshift-cluster",
           node_type="ra3.large",
           master_username="example",
           master_user_password="example",
           cluster_type="single-node",
           cluster_subnet_group_name="example-subnet-group",
           wait_for_completion=True,  # triggers DescribeClusters via waiter
       )
   ```
   
   5. Trigger the DAG.
   
    **Observed Behaviour**
   
   The task fails due to missing `redshift:DescribeClusters` permissions, but 
the Redshift cluster is successfully created and remains active in AWS. The 
cluster is not cleaned up automatically and continues incurring cost.
   
   ### Anything else
   
   Redshift clusters begin incurring cost immediately once creation starts, 
even if the cluster never reaches an `available` state. When post-creation 
failures occur, leaked clusters can therefore result in unexpected and ongoing 
cost.
   
   This issue follows a broader pattern across AWS operators where resources 
are created successfully but not cleaned up when subsequent steps fail. Apache 
Airflow has been introducing best-effort cleanup behavior to address this class 
of problems consistently across providers.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to