Vamsi-klu commented on PR #63581: URL: https://github.com/apache/airflow/pull/63581#issuecomment-4061361400
Hey @YoannAbriel, nice writeup on this one — the race condition between concurrent RTIF writes hitting the unique constraint is a real problem, and catching it at the endpoint level with a retry is a pragmatic approach. A few things I noticed while reading through: **The None guard after re-fetch could be tighter.** After `session.rollback()` and re-fetching `task_instance`, if the TI was deleted between the rollback and the re-fetch, `task_instance` would be None. Right now the code logs the success message and returns 201 regardless. Might be worth either raising a 404 there (consistent with the check above the try block) or at least logging a warning so it doesn't silently look like a successful write. **The `except IntegrityError` is a bit broad.** It catches all integrity errors, not just the unique constraint on `rendered_task_instance_fields_pkey`. For example, a foreign key violation (if the TI was deleted mid-flight) would also be caught and retried, which would just fail again. In practice this probably doesn't cause harm, but checking the constraint name in the error before retrying would make it more precise. **On the tests** — `test_ti_put_rtif_integrity_error_handled` directly validates the retry logic with the mock, which is good. One gap: since the mock replaces the entire `update_rtif` method, it doesn't verify that the session is actually in a usable state after the rollback+re-fetch. A test that checks the RTIF row actually exists in the DB after the retry would strengthen confidence. **Longer term thought** — have you considered using `INSERT...ON CONFLICT DO UPDATE` (upsert) at the SQL level in `RenderedTaskInstanceFields.write()`? SQLAlchemy supports it for Postgres, MySQL, and SQLite. That would eliminate the race at the root instead of handling it at the endpoint. Definitely a bigger change though, so the retry approach makes sense as a targeted fix for now. Good catch overall — RTIF writes being killed by a 409 that the task-sdk doesn't retry is a subtle failure mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
