shivannakarthik opened a new issue, #59812:
URL: https://github.com/apache/airflow/issues/59812

   ### Apache Airflow Provider(s)
   
   google
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Apache Airflow version
   
   main
   
   ### Operating System
   
   ubuntu
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   When implementing the Ephemeral Dataproc Cluster pattern:
   `Create Cluster` -> `Run Jobs` -> `Delete Cluster (TriggerRule.ALL_DONE)`
   
   There is a conflict between the default behavior of 
`DataprocCreateClusterOperator` and the downstream 
`DataprocDeleteClusterOperator`.
   
   1. `DataprocCreateClusterOperator` has `delete_on_error=True` by default. If 
the cluster creation fails and ends up in an `ERROR` state, the operator 
automatically deletes the cluster.
   2. The downstream `DataprocDeleteClusterOperator` triggers (due to 
`TriggerRule.ALL_DONE`).
   3. It attempts to delete the cluster which no longer exists.
   4. The `DataprocDeleteClusterOperator` fails with a `NotFound` (404) error 
from the Google Cloud API.
   
   This causes the cleanup task to be marked as `failed`, which creates noise 
and can potentially mask the actual upstream failure in monitoring views.
   
   ### What you think should happen instead
   
   `DataprocDeleteClusterOperator` should ideally be idempotent. If the cluster 
is already deleted (returns 404 NotFound), the operator should consider the 
task successful (or skipped) rather than failed.
   
   Currently, the `deferrable` mode implementation checks for existence:
   ```python
               try:
                   hook.get_cluster(...)
               except NotFound:
                   self.log.info("Cluster deleted.")
                   return
   ```
   However, the standard synchronous `execute` path does not seem to catch 
`NotFound` exceptions during the delete operation.
   
   ### How to reproduce
   
   1. Create a DAG with `DataprocCreateClusterOperator` -> 
`DataprocDeleteClusterOperator` (with `trigger_rule=TriggerRule.ALL_DONE`).
   2. Force the cluster creation to enter an ERROR state (e.g., by providing 
invalid configuration that passes validation but fails provisioning).
   3. `DataprocCreateClusterOperator` will delete the cluster and fail.
   4. `DataprocDeleteClusterOperator` will run, attempt to delete the missing 
cluster, and fail with `NotFound`.
   
   
   ### Anything else
   
   
   Proposed behaviour:
   1. Update `DataprocDeleteClusterOperator` to catch `NotFound` exceptions 
during the delete operation and log a message instead of raising an error.
   2. Alternatively, update documentation to explicitly recommend setting 
`delete_on_error=False` in `DataprocCreateClusterOperator` when an explicit 
delete task is used.
   
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to