[PR] Add bounded retry cleanup to RedshiftCreateClusterOperator on post-start failure [airflow]

via GitHub Sat, 07 Mar 2026 13:09:02 -0800


SameerMesiah97 opened a new pull request, #63074:
URL: https://github.com/apache/airflow/pull/63074


   **Description**
   
   This change adds bounded best-effort cleanup retries to Redshift cluster 
creation to prevent clusters from being orphaned when failures occur after 
creation has been initiated. The cleanup behavior is implemented via a helper 
method called `_attempt_cleanup_with_retry`.
   
   Previously, `create_cluster` could successfully start provisioning, but the 
operator could fail during post-creation steps (for example if the execution 
identity lacked `redshift:DescribeClusters`). The existing implementation 
attempted cleanup but issued only a single deletion request.
   
   Because Redshift does not allow deletion while a cluster is still being 
created, this cleanup path often failed immediately. With this change, when a 
failure occurs after cluster creation has been initiated, the operator retries 
deletion when the API returns `InvalidClusterState` or 
`InvalidClusterStateFault`, indicating that another operation is still in 
progress. Cleanup retries occur within a bounded window controlled by 
`cleanup_timeout_seconds` (default: 300 seconds). Cleanup failures are logged 
and do not mask the original exception.
   
   **Rationale**
   
   Redshift cluster creation is asynchronous. Failures can occur after creation 
has started but before the operator is able to poll cluster status (for example 
due to missing `redshift:DescribeClusters` permissions). In these scenarios the 
cluster continues provisioning even though the task fails.
   
   The existing cleanup logic attempted deletion only once, which frequently 
fails because Redshift rejects deletion while a cluster is still being created. 
This change introduces semantic retries with a bounded timeout so deletion can 
succeed once the cluster reaches a deletable lifecycle state.
   
   The default timeout of 300 seconds was chosen based on several test runs 
during which clusters reached a deletable state in less than that time. This 
approach mirrors the bounded retry cleanup pattern introduced for GKE cluster 
creation in PR #62302.
   
   **Tests**
   
   Added unit tests that verify:
   
   * the existing cleanup tests were updated to account for retry-based cleanup 
behavior.
   * deletion is retried when `InvalidClusterState` or 
`InvalidClusterStateFault` indicates that the cluster still has an active 
operation.
   
   **Documentation**
   
   The docstring for `RedshiftCreateClusterOperator` has been updated to 
document the new `cleanup_timeout_seconds` parameter and its default behavior.
   
   **Backwards Compatibility**
   
   A new optional parameter `cleanup_timeout_seconds` (defaulting to `300` 
seconds) has been added to `RedshiftCreateClusterOperator` to control the 
bounded cleanup retry window.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add bounded retry cleanup to RedshiftCreateClusterOperator on post-start failure [airflow]

Reply via email to