Vamsi-klu commented on PR #63581:
URL: https://github.com/apache/airflow/pull/63581#issuecomment-4061361400

   Hey @YoannAbriel, nice writeup on this one — the race condition between 
concurrent RTIF writes hitting the unique constraint is a real problem, and 
catching it at the endpoint level with a retry is a pragmatic approach.
   
   A few things I noticed while reading through:
   
   **The None guard after re-fetch could be tighter.** After 
`session.rollback()` and re-fetching `task_instance`, if the TI was deleted 
between the rollback and the re-fetch, `task_instance` would be None. Right now 
the code logs the success message and returns 201 regardless. Might be worth 
either raising a 404 there (consistent with the check above the try block) or 
at least logging a warning so it doesn't silently look like a successful write.
   
   **The `except IntegrityError` is a bit broad.** It catches all integrity 
errors, not just the unique constraint on `rendered_task_instance_fields_pkey`. 
For example, a foreign key violation (if the TI was deleted mid-flight) would 
also be caught and retried, which would just fail again. In practice this 
probably doesn't cause harm, but checking the constraint name in the error 
before retrying would make it more precise.
   
   **On the tests** — `test_ti_put_rtif_integrity_error_handled` directly 
validates the retry logic with the mock, which is good. One gap: since the mock 
replaces the entire `update_rtif` method, it doesn't verify that the session is 
actually in a usable state after the rollback+re-fetch. A test that checks the 
RTIF row actually exists in the DB after the retry would strengthen confidence.
   
   **Longer term thought** — have you considered using `INSERT...ON CONFLICT DO 
UPDATE` (upsert) at the SQL level in `RenderedTaskInstanceFields.write()`? 
SQLAlchemy supports it for Postgres, MySQL, and SQLite. That would eliminate 
the race at the root instead of handling it at the endpoint. Definitely a 
bigger change though, so the retry approach makes sense as a targeted fix for 
now.
   
   Good catch overall — RTIF writes being killed by a 409 that the task-sdk 
doesn't retry is a subtle failure mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to