SameerMesiah97 opened a new pull request, #63074: URL: https://github.com/apache/airflow/pull/63074
**Description** This change adds bounded best-effort cleanup retries to Redshift cluster creation to prevent clusters from being orphaned when failures occur after creation has been initiated. The cleanup behavior is implemented via a helper method called `_attempt_cleanup_with_retry`. Previously, `create_cluster` could successfully start provisioning, but the operator could fail during post-creation steps (for example if the execution identity lacked `redshift:DescribeClusters`). The existing implementation attempted cleanup but issued only a single deletion request. Because Redshift does not allow deletion while a cluster is still being created, this cleanup path often failed immediately. With this change, when a failure occurs after cluster creation has been initiated, the operator retries deletion when the API returns `InvalidClusterState` or `InvalidClusterStateFault`, indicating that another operation is still in progress. Cleanup retries occur within a bounded window controlled by `cleanup_timeout_seconds` (default: 300 seconds). Cleanup failures are logged and do not mask the original exception. **Rationale** Redshift cluster creation is asynchronous. Failures can occur after creation has started but before the operator is able to poll cluster status (for example due to missing `redshift:DescribeClusters` permissions). In these scenarios the cluster continues provisioning even though the task fails. The existing cleanup logic attempted deletion only once, which frequently fails because Redshift rejects deletion while a cluster is still being created. This change introduces semantic retries with a bounded timeout so deletion can succeed once the cluster reaches a deletable lifecycle state. The default timeout of 300 seconds was chosen based on several test runs during which clusters reached a deletable state in less than that time. This approach mirrors the bounded retry cleanup pattern introduced for GKE cluster creation in PR #62302. **Tests** Added unit tests that verify: * the existing cleanup tests were updated to account for retry-based cleanup behavior. * deletion is retried when `InvalidClusterState` or `InvalidClusterStateFault` indicates that the cluster still has an active operation. **Documentation** The docstring for `RedshiftCreateClusterOperator` has been updated to document the new `cleanup_timeout_seconds` parameter and its default behavior. **Backwards Compatibility** A new optional parameter `cleanup_timeout_seconds` (defaulting to `300` seconds) has been added to `RedshiftCreateClusterOperator` to control the bounded cleanup retry window. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
