Beat-Nick commented on PR #68358:
URL: https://github.com/apache/airflow/pull/68358#issuecomment-4744193603

   > I'd like to open a design discussion rather than request changes — mostly 
about how this positions itself relative to the other recovery mechanisms in 
play.
   
   Thanks, digging into these shifted how I think this should be positioned.
   
   Context on how I got here: the failures driving this were all transient (a 
~5 min upstream-source outage, a library that failed to install, a Python 
kernel going unresponsive), and a job repair cleared each one. So that's what I 
reached for and what this PR builds. But with fresh eyes, perhaps native task 
retries may be the better tool here, and the task group exposes neither today. 
   
   **Repair vs. native retries.** Complementary, split by cluster lifecycle. 
Native retries (`max_retries` / `min_retry_interval_millis`) re-run the failed 
task in-flight on the same cluster; `repair_run` acts on a terminal run, so it 
gets a fresh cluster and can re-run failed and dependent tasks. For the three 
failures above, retries are the better primary tool. So retries as first line, 
`workflow_repair_attempts` as the run-level backstop for what retries can't 
reach (fresh-cluster recovery on a degraded driver/node).
   
   **Interaction with Airflow `retries`.** If retries are set on the task 
level, they will work how they today, which is admittedly confusing. On a 
failure Airflow retries only re-run the monitor, where it finds nothing to 
repair and fails again.
   
   **Path forward.** I'd like to split native retries into its own PR first and 
keep this one as a potential follow-up for the cases retries can't cover. Sound 
right, or would you rather both land together here with the positioning 
documented up front?
   
   ---
   Drafted-by: Claude Code (Opus 4.8); reviewed by @BeatNick before posting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to