radhwene commented on issue #68040:
URL: https://github.com/apache/airflow/issues/68040#issuecomment-4674903623

   Thanks @anmolxlight and @Vamsi-klu for picking this up — both #68146 and 
#68151 look like careful pieces of work.
   
   After running the sensor design end-to-end against a real Cloud SQL 
PostgreSQL instance, I want to share evidence that a **sensor-only design 
cannot fully solve #68040**. This is not about the implementation quality of 
either PR; it is a property of where the check sits in the DAG.
   
   ## The problem, precisely
   
   Cloud SQL serializes administrative operations **per instance** on the 
server side. The `409 operationInProgress` is raised by the operator submit 
call itself — `import_instance` or `export_instance`.
   
   Any design that puts the guard in a separate upstream task leaves a gap 
between:
   
   1. the sensor checking that the instance is idle;
   2. the downstream operator being scheduled;
   3. the operator submitting the actual Cloud SQL Admin API request.
   
   Nothing holds a lock across that gap.
   
   ## Reproduction
   
   The minimal trigger is two parallel Cloud SQL CSV imports against the same 
instance:
   
   ```python
   create_tables >> [import_complaints, import_crime]
   ```
   
   On the stock provider, one import succeeds and the other can fail with:
   
   ```text
   HTTP 409 / reason: operationInProgress
   ```
   
   ## Why the sensor does not close the race
   
   I tested the sensor approach with `CloudSQLNoOperationInProgressSensor` in 
`reschedule` mode before each import:
   
   ```python
   create_tables >> wait_a >> import_complaints
   create_tables >> wait_b >> import_crime
   ```
   
   In a live run, `wait_a` succeeds, and the immediately downstream 
`import_complaints` task still fails with `409 operationInProgress`.
   
   ![Sensor passes, import still fails with 
409](https://raw.githubusercontent.com/radhwene/airflow-cloudsql-409-artifacts/main/sensoror_solution.png)
   
   The race is structural:
   
   ```text
   T0  sensor: operations.list -> "no op running" -> SUCCESS
           |
           | gap = task scheduling + executor latency
           v
   T1  operator: submit -> another operation started in [T0,T1] -> 409
   ```
   
   Three points matter here:
   
   1. **No lock is held across the gap.** The check at `T0` does not guarantee 
the instance state at `T1`.
   2. **The Cloud SQL constraint is global, while the sensor is local to a DAG 
path.** Another DAG, scheduler, user, maintenance operation, or external 
process can start an operation after the sensor succeeds.
   3. **It requires per-DAG wiring.** Existing DAGs would need to be changed, 
and new DAGs would need to remember to add the sensor before each relevant 
admin operation.
   
   A sensor can reduce the probability of the race. It cannot remove it. 
Durable protection has to exist at the submit call.
   
   ## Proposed fix
   
   `CloudSQLHook` already uses `GoogleBaseHook.operation_in_progress_retry()` 
for several Cloud SQL admin methods. `import_instance` and `export_instance` 
appear to be the missing cases.
   
   The fix I propose is to extend the existing hook-level retry policy to:
   
   * `CloudSQLHook.import_instance`
   * `CloudSQLHook.export_instance`
   
   This retries the transient `operationInProgress` response at the submit 
point, with no DAG changes required.
   
   There is also an exception-handling detail in `import_instance`: it 
currently wraps `HttpError` into `AirflowException`, which can prevent the 
retry decorator from observing the original retryable `HttpError`. So the fix 
needs to preserve retryable operation-in-progress `HttpError`s for the 
decorator, while keeping the existing `AirflowException` behavior for terminal 
errors.
   
   ## Validation
   
   I validated the hook-level fix with a higher-contention topology: two 
imports and two exports submitted in parallel against the same Cloud SQL 
instance, with no sensor.
   
   All four tasks completed successfully after submit retries, with no `409 
operationInProgress` surfaced as a final task failure.
   
   ![No sensor, four parallel admin ops all 
succeed](https://raw.githubusercontent.com/radhwene/airflow-cloudsql-409-artifacts/main/new_version_of_operator.png)
   
   ## Proposal
   
   I will open a small PR that:
   
   * adds `operation_in_progress_retry()` to `import_instance` and 
`export_instance`;
   * fixes the `import_instance` exception wrapping so retryable `HttpError`s 
remain visible to the retry decorator;
   * adds unit tests for the retry behavior and exception propagation.
   
   The sensor can still be useful as an optional visibility or pre-wait 
primitive. I just do not think it should be considered the correctness fix for 
#68040 unless it is documented as best-effort, with durable `409 
operationInProgress` protection handled in the hook.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to