Re: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Yuseok Jo Tue, 05 May 2026 07:15:46 -0700

Hi,

Thanks Jorge for sharing the context. I went through #21867 and the PoC at
#61809.
I'd like to float a possible direction: what if we tried a smaller v1 scope
that lands TaskGroup-level retries with minimal impact on
existing behavior? The more nuanced options discussed in the thread could
follow as separate PRs once the core is in.
Since the existing issue and PoC are already framed as "TaskGroup
retry", I'd suggest staying with that naming rather than introducing
alternatives(e.g., "transactional task group"), to preserve continuity.


Proposed v1 scope
-----------------
API, mirroring task-level "retries":

    @task_group(retries=3)
    def my_group():
        task_a()
        task_b()

    # context manager form
    with TaskGroup(group_id="my_group", retries=3) as tg:
        task_a()
        task_b()

Behavior:
 - A TaskGroup with "retries=N" (default 0, current behavior) is considered
"failed" once any task within it reaches FAILED after its own task-level
retries are exhausted.
 - On group failure with remaining retries: clear all task instances
within the group via the existing "clear_task_instances" path, increment
a per-DagRun group counter, and the scheduler picks the cleared TIs back up.
 - When the counter is exhausted, the group settles as failed and
the DagRun proceeds per the usual leaf-state evaluation.
 - Tasks outside the group are unaffected, so partial application
is preserved.

Out of v1, deferred to follow-ups:
 - retry_condition (any/last/custom): v1 is "any failed" only.
 - retry_strategy (all_tasks/failed_tasks): v1 is "all_tasks" only.
 - Group-level "retry_delay": task-level "retry_delay" still applies.
 - Cancellation policy for sibling tasks still running: keep the
existing clear behavior (RESTARTING).
 - Nested groups: each group manages its own counter; outer clear
cascades naturally via the existing clear path. No new concept added.
 - UI affordances beyond what is strictly needed for visibility.


The items previously flagged as AIP-worthy (multi-leaf "last",
concurrent task cancellation, retry_strategy choices, etc.) are excluded
from this v1 scope, so my impression is that v1 may not need an AIP.

Does this v1 scope feel like a reasonable starting point?

If the direction sounds reasonable and no major concerns come up over
the next couple of weeks, I'd be happy to take a stab at the implementation.
Edge-case discussions and the deferred follow-up scope can continue
under separate threads/PRs by anyone interested.

Thanks,
Yuseok Jo

On Mon, Apr 27, 2026 at 3:03 AM Jorge Rocamora García <
[email protected]> wrote:

> Hi,
>
> Just to add some context here: there is already an open issue for
> supporting retries at the TaskGroup level:
>
> https://github.com/apache/airflow/issues/21867
>
> There was also an initial PoC PR exploring this:
>
> https://github.com/apache/airflow/pull/61809
>
> I left the PR on hold while waiting for more feedback from the community,
> but I'd be happy to revisit it if there is interest in moving this forward.
>
> Best,
> Jorge
>
> On 2026/04/26 15:14:55 Yuseok Jo wrote:
> > I strongly agree with the principle that tasks should ideally be designed
> > to be idempotent at the task level. The alternatives you suggested look
> > genuinely useful for considering this issue.
> >
> > - Setup/Teardown fits well when the main concern is bracketing a
> > pipeline with preparation/finalization, though it doesn't directly
> address
> > failures in intermediate tasks.
> > - The on_failure_callback approach seems like something that can serve
> > other Airflow users with the same need through documentation alone,
> without
> > any code changes.
> > - QualityCheckOperator aligns better with data-quality validation than
> > with arbitrary task-failure recovery, though the underlying "clear via
> API"
> > building block it relies on is shared with the callback approach.
> > - *TransactionTaskGroup* is an intriguing idea. As I understand it, it
> > would be a TaskGroup with roughly the following behavior:
> > - If any task within the group ultimately fails, the entire group
> > becomes the target for clearing & retrying (following the DAG-level retry
> > policy)
> > - Tasks outside the group are unaffected → partial application is
> > possible
> > - Extending the existing TaskGroup feels like a natural shape
> > - And simply placing all tasks of a DAG into a single such group
> > would produce the same effect as the original request.
> >
> > That said, to be transparent: I was not a strong stakeholder in this
> issue
> > myself. The original reporter went silent and I escalated this to the
> > devlist on their behalf, so I was not in a great position to advocate for
> > the use case's urgency. Apologies also for the slow reply.
> >
> > Given that, here is a reasonable direction:
> >
> > - Short term / immediate value: documenting the on_failure_callback +
> > clear-API pattern as a how-to or example would help other users with the
> > same need right away. Happy to put up a small PR for this.
> > - Longer term: *TransactionTaskGroup* feels like it has value beyond
> > this specific issue. I'd be glad to contribute.
> >
> > Thanks again for the detailed and thoughtful response.
> > It really helped clarify things.
> >
> > On Mon, Apr 20, 2026 at 5:25 AM Jens Scheffler <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > as nobody else was answering on the DISCUSS let me try to break the
> ice.
> > > I was commenting on the PR already.
> > >
> > > I am not a big fan of adding more parameters for the retry as I assume
> a
> > > lot of options are already existing. Yes and mainly on task level.
> > >
> > > My proposal in general would be to model a pipeline in a way that all
> > > tasks are idempotent and not the full pipeline needs to be retried.
> This
> > > is in a matter of cost as well as a matter of time. If you need to run
> > > the full chain then this either smells like the pipeline is badly
> > > modelled as e.g. tasks are not idempotent or it is actually a re-run
> > > with changed parameters (maybe it has been started wrong). A technical
> > > need to re-run all ... might be also a backfill case? So I am not
> seeing
> > > a strong case that would have been missed as a feature in the last 10
> > > years.
> > >
> > > If there actually is (and please convince me of any reason with the
> > > right arguments) then I'd still would ask to consider the following
> > > options:
> > >
> > > * Is the workflow actually mainly requiring to make something before
> > > as preparation and maybe something as finalization? Then the
> > > "Startup/Teardown" tasks might be a good composite. Especially if
> > > the pipeline is only 3 tasks then you can use this to ensure all is
> > > re-running
> > > * You could also attempt to fix this without changes in the scheduler
> > > via a on_failure_callback (see
> > >
> > >
> https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/callbacks.html#callback-types
> > > )
> > > and hook a function that clears all tasks via API - and attach this
> > > callback as default to all tasks or to the Dag at the end.
> > > * Instead of extending the Dag and Scheduler logic I would imagine
> > > there might be an option to implement a "QualityCheckOperator" that
> > > takes a condition and in case of not meeting quality criteria then
> > > makes a "Clear DagRun" via API. This would not require additional
> > > Dag parameters and would not need any extensions on the scheduler
> > > but via API could be called from an Operator as alternative.
> > > * I could also imagine that the request raised was namig a Dag but
> > > then a moment later somebody will have the same with a set of tasks
> > > only. So an alternative as well could be having a
> > > "TransactionTaskGroup" which would call all tasks in that task group
> > > being somehow a combined transaction. If one is cleared or one needs
> > > a retry, all together are retried. Then you could apply this to a
> > > subset of tasks or if all tasks are in that group for the full Dag.
> > >
> > > So if the reporter is silent now then we might need to get the original
> > > voice and see if one of the options are already a solution to the
> > > problem. Happy to be convinced.
> > >
> > > Jens
> > >
> > > On 08.04.26 22:12, Przemysław Mirowski wrote:
> > > > Hello,
> > > >
> > > > I checked the discussion and I don't really see any real use case
> where
> > > that could be potentially needed. The tasks currently can send some
> data
> > > between their executions via xcom or some other methods implemented in
> task
> > > logic, but these data should rather not change if the input didn't
> change
> > > (e.g. from upstream tasks), so the retrying on task level should be
> > > sufficient.
> > > >
> > > >> One user-side story I can picture is ML-style pipelines where a
> final
> > > validation or evaluation step fails and teams want a full rerun of the
> run
> > > instead of only retrying failed tasks.
> > > > Failure within the ML pipeline, IMHO would only require the retry on
> > > task level as the e.g. models, after training, should be saved and
> used by
> > > other tasks. Potential issue which I would see (within the ML
> pipelines)
> > > would be when the task itself would fail and retrying whole operation
> is
> > > expensive, but that part could be solved after AIP-103.
> > > >
> > > > Maybe the only need for retrying everything (without thinking
> > > Airflow-specific) would be e.g. some time-series or streaming-related
> cases
> > > where after a failure somewhere, whole processing becomes invalid
> > > (basically the operations where there is no possibility of process
> design
> > > which would allow for only retrying the part of it).
> > > >
> > > >> Do you feel this need in practice?/do you see it as something that
> > > belongs in core?
> > > > Not really, at least for now.
> > > >
> > > >> How do you work around it today?
> > > > Designing the processes in a way were only task-level are needed if
> > > failure occur.
> > > >
> > > > Regards,
> > > > PM
> > > >
> > > > ________________________________
> > > > From: Yuseok Jo<[email protected]>
> > > > Sent: 07 April 2026 15:07
> > > > To:[email protected] <[email protected]>
> > > > Subject: [DISCUSS] Feedback on DAG-level full-run retries (issue
> 60866)
> > > >
> > > > Hello community,
> > > >
> > > > I would like to pick up discussion on GitHub issue 60866 about
> DAG-level
> > > > automatic retries or rerunning a whole DAG run from the start when a
> > > > terminal task fails or the DAG run ends in a certain state.
> > > > https://github.com/apache/airflow/issues/60866
> > > >
> > > > I am not the person who originally opened that issue, and the
> original
> > > > author may not be active now. I am unsure whether this is a real gap
> for
> > > > users or something we should handle with patterns we already have.
> > > >
> > > > One user-side story I can picture is ML-style pipelines where a final
> > > > validation or evaluation step fails and teams want a full rerun of
> the
> > > run
> > > > instead of only retrying failed tasks. This is just one possible
> > > scenario.
> > > > Other domains may have similar needs.
> > > >
> > > > I am not proposing a core change yet. I mainly want light feedback on
> > > three
> > > > points.
> > > > Do you feel this need in practice?
> > > > How do you work around it today?
> > > > And do you see it as something that belongs in core?
> > > >
> > > > Thanks,
> > > > Yuseok Jo
> > > >
> >
>
>

Re: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Reply via email to