moomindani commented on issue #52280:
URL: https://github.com/apache/airflow/issues/52280#issuecomment-4571822559
Thanks for the ping @eladkal — looking at this from the Databricks side.
A couple of updates worth folding in before settling on direction:
**1. The "wait for Airflow 3.1" rationale in the issue is largely obsolete**
Timeline-wise, the issue was filed 2025-06 when 3.0 had just shipped. As of
today, 3.1 (released 2025-09) and 3.2 (2026-04) are both out, and the relevant
plugin extensibility is in place:
- `fastapi_apps` is available since 3.0 —
`airflow-core/src/airflow/plugins_manager.py:234`, with working examples in
`providers/edge3/.../edge_executor_plugin.py`. So the HTTP endpoint piece
doesn't need to wait.
- `react_apps` (3.1+) is what enables a richer "pick which tasks to repair"
UI, if we want it.
- Auth dependencies are in place: `GetUserDep` and
`permitted_dag_filter_factory` in
`airflow-core/src/airflow/api_fastapi/core_api/security.py` cover the
equivalent of the current FAB view's `auth.has_access_dag("POST",
DagAccessEntity.RUN)`.
- The "no direct DB access" rule is for workers / triggerers / DFP, not the
API server — so the FastAPI handler can talk to the metadata DB normally.
Of the 4 goals listed in the issue, goals 2 (FastAPI endpoint), 3
(auth/authz), and 4 (defer to 3.1) are essentially resolved by what Airflow has
shipped since. Only goal 1 (timing mismatch) is still an open design question.
**2. The timing mismatch — three shapes worth comparing**
The pre-computed-at-execution-time vs. needs-runtime-failed-state problem is
the real architectural challenge. Three shapes I'd consider:
- **A. Server-side resolution (minimal)**: XCom stores a static endpoint URL
`/databricks/repair/{dag_id}/{run_id}`. The FastAPI handler resolves failed
Databricks tasks server-side at click time, calls `repair_run`, and clears
matching Airflow TIs. Same UX as today, implementable on 3.0.
- **B. Two-step UX (richer)**: XCom stores an external view URL; the view
fetches failed tasks via FastAPI and lets the user pick which to repair. Closer
to the Databricks-native repair flow but really wants `react_apps` to look
right.
- **C. Operator-level repair**: A `DatabricksRepairFailedOperator` added
downstream with `trigger_rule="one_failed"` that auto-detects failed Databricks
tasks and repairs. Sidesteps UI plugins entirely; can co-exist with A or B.
My read: A restores parity for Airflow 3 users with the smallest surface, B
becomes more attractive once a project commits to a React plugin UI, and C is
independently useful for the "DAG-native repair" crowd regardless of which UI
path lands.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]