Re: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Przemysław Mirowski Wed, 08 Apr 2026 13:12:40 -0700

Hello,

I checked the discussion and I don't really see any real use case where that 
could be potentially needed. The tasks currently can send some data between 
their executions via xcom or some other methods implemented in task logic, but 
these data should rather not change if the input didn't change (e.g. from 
upstream tasks), so the retrying on task level should be sufficient.


> One user-side story I can picture is ML-style pipelines where a final 
> validation or evaluation step fails and teams want a full rerun of the run 
> instead of only retrying failed tasks.

Failure within the ML pipeline, IMHO would only require the retry on task level 
as the e.g. models, after training, should be saved and used by other tasks. 
Potential issue which I would see (within the ML pipelines) would be when the 
task itself would fail and retrying whole operation is expensive, but that part 
could be solved after AIP-103.

Maybe the only need for retrying everything (without thinking Airflow-specific) 
would be e.g. some time-series or streaming-related cases where after a failure 
somewhere, whole processing becomes invalid (basically the operations where 
there is no possibility of process design which would allow for only retrying 
the part of it).

> Do you feel this need in practice?/do you see it as something that belongs in 
> core?
Not really, at least for now.

> How do you work around it today?
Designing the processes in a way were only task-level are needed if failure 
occur.

Regards,
PM

________________________________
From: Yuseok Jo <[email protected]>
Sent: 07 April 2026 15:07
To: [email protected] <[email protected]>
Subject: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Hello community,

I would like to pick up discussion on GitHub issue 60866 about DAG-level
automatic retries or rerunning a whole DAG run from the start when a
terminal task fails or the DAG run ends in a certain state.
https://github.com/apache/airflow/issues/60866

I am not the person who originally opened that issue, and the original
author may not be active now. I am unsure whether this is a real gap for
users or something we should handle with patterns we already have.

One user-side story I can picture is ML-style pipelines where a final
validation or evaluation step fails and teams want a full rerun of the run
instead of only retrying failed tasks. This is just one possible scenario.
Other domains may have similar needs.

I am not proposing a core change yet. I mainly want light feedback on three
points.
Do you feel this need in practice?
How do you work around it today?
And do you see it as something that belongs in core?

Thanks,
Yuseok Jo

Re: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Reply via email to