Re: [PR] Databricks workflow automatic repair airflow3 [airflow]

via GitHub Thu, 18 Jun 2026 16:37:57 -0700


moomindani commented on PR #68358:
URL: https://github.com/apache/airflow/pull/68358#issuecomment-4747002622


   This is a great write-up — the cluster-lifecycle split (native retries = 
in-flight, same cluster; repair = terminal run, fresh cluster + 
failed/dependent tasks) is exactly the right mental model, and it matches the 
failures you described.
   
   I'd go with splitting: land native task retries (`max_retries` / 
`min_retry_interval_millis` on the task-group task spec) as its own PR first, 
and keep this repair PR as the scoped follow-up for what retries can't reach 
(fresh-cluster recovery on a degraded driver/node).
   
   Reasoning:
   - Native retries is the smaller, lower-risk change and covers the majority 
of the real cases you hit (transient source outage, flaky install, unresponsive 
kernel), so it delivers most of the value on its own and is easy to review.
   - This repair PR is substantial (coordinator injection, sync/deferrable 
parity, the shared-deadline coordination). Landing it after retries exist lets 
it be scoped precisely to the fresh-cluster case and reviewed on its own 
merits, instead of carrying the "why not just retries?" question.
   
   Two things worth folding into the native-retries PR while you're there:
   - **Document the Airflow `retries` interaction.** As you noted, Airflow 
task-level `retries` in the task group just re-run the monitor (no-op against 
an already-terminal sub-run), so users should reach for the Databricks-side 
`max_retries` instead. Calling that out will save people the exact confusion 
you described.
   - **A line on the retries-vs-repair boundary** — which failure classes 
retries cover vs. which actually need a fresh cluster — so the follow-up's 
scope is clear up front.
   
   That's my read as a reviewer; @eladkal may have a preference on sequencing 
too.
   
   ---
   Drafted-by: Claude Code (Opus 4.8); reviewed by @moomindani before posting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Databricks workflow automatic repair airflow3 [airflow]

Reply via email to