Hello, I checked the discussion and I don't really see any real use case where that could be potentially needed. The tasks currently can send some data between their executions via xcom or some other methods implemented in task logic, but these data should rather not change if the input didn't change (e.g. from upstream tasks), so the retrying on task level should be sufficient.
> One user-side story I can picture is ML-style pipelines where a final > validation or evaluation step fails and teams want a full rerun of the run > instead of only retrying failed tasks. Failure within the ML pipeline, IMHO would only require the retry on task level as the e.g. models, after training, should be saved and used by other tasks. Potential issue which I would see (within the ML pipelines) would be when the task itself would fail and retrying whole operation is expensive, but that part could be solved after AIP-103. Maybe the only need for retrying everything (without thinking Airflow-specific) would be e.g. some time-series or streaming-related cases where after a failure somewhere, whole processing becomes invalid (basically the operations where there is no possibility of process design which would allow for only retrying the part of it). > Do you feel this need in practice?/do you see it as something that belongs in > core? Not really, at least for now. > How do you work around it today? Designing the processes in a way were only task-level are needed if failure occur. Regards, PM ________________________________ From: Yuseok Jo <[email protected]> Sent: 07 April 2026 15:07 To: [email protected] <[email protected]> Subject: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866) Hello community, I would like to pick up discussion on GitHub issue 60866 about DAG-level automatic retries or rerunning a whole DAG run from the start when a terminal task fails or the DAG run ends in a certain state. https://github.com/apache/airflow/issues/60866 I am not the person who originally opened that issue, and the original author may not be active now. I am unsure whether this is a real gap for users or something we should handle with patterns we already have. One user-side story I can picture is ML-style pipelines where a final validation or evaluation step fails and teams want a full rerun of the run instead of only retrying failed tasks. This is just one possible scenario. Other domains may have similar needs. I am not proposing a core change yet. I mainly want light feedback on three points. Do you feel this need in practice? How do you work around it today? And do you see it as something that belongs in core? Thanks, Yuseok Jo
