radhwene commented on issue #68040: URL: https://github.com/apache/airflow/issues/68040#issuecomment-4674903623
Thanks @anmolxlight and @Vamsi-klu for picking this up — both #68146 and #68151 look like careful pieces of work. After running the sensor design end-to-end against a real Cloud SQL PostgreSQL instance, I want to share evidence that a **sensor-only design cannot fully solve #68040**. This is not about the implementation quality of either PR; it is a property of where the check sits in the DAG. ## The problem, precisely Cloud SQL serializes administrative operations **per instance** on the server side. The `409 operationInProgress` is raised by the operator submit call itself — `import_instance` or `export_instance`. Any design that puts the guard in a separate upstream task leaves a gap between: 1. the sensor checking that the instance is idle; 2. the downstream operator being scheduled; 3. the operator submitting the actual Cloud SQL Admin API request. Nothing holds a lock across that gap. ## Reproduction The minimal trigger is two parallel Cloud SQL CSV imports against the same instance: ```python create_tables >> [import_complaints, import_crime] ``` On the stock provider, one import succeeds and the other can fail with: ```text HTTP 409 / reason: operationInProgress ``` ## Why the sensor does not close the race I tested the sensor approach with `CloudSQLNoOperationInProgressSensor` in `reschedule` mode before each import: ```python create_tables >> wait_a >> import_complaints create_tables >> wait_b >> import_crime ``` In a live run, `wait_a` succeeds, and the immediately downstream `import_complaints` task still fails with `409 operationInProgress`.  The race is structural: ```text T0 sensor: operations.list -> "no op running" -> SUCCESS | | gap = task scheduling + executor latency v T1 operator: submit -> another operation started in [T0,T1] -> 409 ``` Three points matter here: 1. **No lock is held across the gap.** The check at `T0` does not guarantee the instance state at `T1`. 2. **The Cloud SQL constraint is global, while the sensor is local to a DAG path.** Another DAG, scheduler, user, maintenance operation, or external process can start an operation after the sensor succeeds. 3. **It requires per-DAG wiring.** Existing DAGs would need to be changed, and new DAGs would need to remember to add the sensor before each relevant admin operation. A sensor can reduce the probability of the race. It cannot remove it. Durable protection has to exist at the submit call. ## Proposed fix `CloudSQLHook` already uses `GoogleBaseHook.operation_in_progress_retry()` for several Cloud SQL admin methods. `import_instance` and `export_instance` appear to be the missing cases. The fix I propose is to extend the existing hook-level retry policy to: * `CloudSQLHook.import_instance` * `CloudSQLHook.export_instance` This retries the transient `operationInProgress` response at the submit point, with no DAG changes required. There is also an exception-handling detail in `import_instance`: it currently wraps `HttpError` into `AirflowException`, which can prevent the retry decorator from observing the original retryable `HttpError`. So the fix needs to preserve retryable operation-in-progress `HttpError`s for the decorator, while keeping the existing `AirflowException` behavior for terminal errors. ## Validation I validated the hook-level fix with a higher-contention topology: two imports and two exports submitted in parallel against the same Cloud SQL instance, with no sensor. All four tasks completed successfully after submit retries, with no `409 operationInProgress` surfaced as a final task failure.  ## Proposal I will open a small PR that: * adds `operation_in_progress_retry()` to `import_instance` and `export_instance`; * fixes the `import_instance` exception wrapping so retryable `HttpError`s remain visible to the retry decorator; * adds unit tests for the retry behavior and exception propagation. The sensor can still be useful as an optional visibility or pre-wait primitive. I just do not think it should be considered the correctness fix for #68040 unless it is documented as best-effort, with durable `409 operationInProgress` protection handled in the hook. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
