Re: Re: [DISCUSS] Task Group Retries

Vikram Koka via dev Wed, 25 Feb 2026 06:43:06 -0800

Hi Jorge,

I really appreciate your thinking on this and the direction of the proposed
changes.


Task Groups were originally conceived as a UI only construct.
This was intended to make it easier for users to view their DAGs, but was
not intended to change how they run those DAGs.

I definitely support the change you are proposing here at a conceptual
level.
I am struggling a little with reviewing the PR at the moment, but do intend
to spend more time looking at it.

The one key area of concern I have is with respect to the database changes
needed and especially DB migrations required.
This is purely a caution to consider during implementation and as part of
rollout.

I am looking forward to seeing this evolve.

Vikram


On Wed, Feb 18, 2026 at 7:17 PM Jorge Rocamora García <
[email protected]> wrote:

> Hi all,
>
> I’d like to clarify that several concrete use cases were already described
> in the original issue: https://github.com/apache/airflow/issues/21867
>
> One important aspect is that with the deprecation of SubDAGs in favor of
> TaskGroups, some retry semantics were lost.
>
> In my specific case, I’m using the KubernetesPodOperator, where different
> steps must run in separate pods because they depend on different software.
> However,  conceptually, the entire block needs to behave as a single
> logical unit. For example:
>
> - A: Create a PersistentVolumeClaim (PVC) to share data
> - B: Retrieve and prepare inputs
> - C: Run the analysis
> - D: Remove the PVC
>
> This pattern was previously achievable with SubDAGs, but there is currently
> no straightforward mechanism that preserves this grouped execution and
> retry behavior.
>
> Best regards,
> Jorge
>
> On 2026/02/18 22:20:10 Daniel Standish via dev wrote:
> > Yeah I think arguing that there’s a need for it with use cases is a good
> > idea.
> >
> >
> > On Wed, Feb 18, 2026 at 12:02 PM Natanel <[email protected]> wrote:
> >
> > > Hello, I have skimmed over the PR, overall I have to say that it looks
> > > good.
> > > I have yet to find a use case for this (as I just can't think of one)
> where
> > > I find the feature useful, and I will appreciate it if you could give
> an
> > > example use case for the feature, as it looks like quite a bit of
> changes
> > > have been introduced (including a new table and new dependency types)
> for a
> > > feature which allows for task groups to be retried.
> > >
> > > I would love to hear about what the use case of the feature is, as I
> just
> > > can't think of one, I think that it might be simpler to implement if we
> do
> > > something like a composite task instance, yet I do not want to propose
> > > anything before I hear mroe about the use case, as I am most likely
> just
> > > missing something.
> > >
> > > Best regards,
> > > Natanel.
> > >
> > > On Wed, 18 Feb 2026 at 17:49, Jorge Rocamora García <
> > > [email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I’d like to start a discussion around Task Group retries.
> > > >
> > > > Issue: https://github.com/apache/airflow/issues/21867
> > > > PR: https://github.com/apache/airflow/pull/61809
> > > >
> > > > This PR introduces a proof of concept for TaskGroup retries, allowing
> a
> > > > whole TaskGroup to be retried as a unit rather than relying only on
> > > > individual task retries.
> > > >
> > > > In addition to standard retry parameters (retries, retry_delay,
> > > > exponential backoff, etc.), this proposal introduces
> TaskGroup-specific
> > > > retry semantics, including:
> > > >
> > > >
> > > > *
> > > > retry_condition: allows defining when a group should be retried
> (e.g.,
> > > > based on aggregated task states), enabling more flexible policies
> than
> > > > simple failure-based retries.
> > > > *
> > > > retry_fast_fail: enables fail-fast behavior within the group, so that
> > > once
> > > > a retry-triggering condition is met, the group can short-circuit
> > > remaining
> > > > tasks and move directly to retry handling.
> > > >
> > > > The implementation adds retry configuration to TaskGroup, introduces
> a
> > > > task_group_instance model to persist retry state per DagRun, and
> includes
> > > > scheduler logic to evaluate retry conditions, enforce delay/backoff,
> and
> > > > clear group tasks for subsequent attempts. The feature is opt-in and
> does
> > > > not affect existing DAGs unless configured.
> > > >
> > > > I’d appreciate feedback on:
> > > >
> > > >
> > > > *
> > > > The proposed API.
> > > > *
> > > > The scheduler and state-management approach.
> > > > *
> > > > The new model/migration.
> > > > *
> > > > Whether the retry semantics feel intuitive and consistent with
> existing
> > > > task-level retries.
> > > > *
> > > > ..
> > > >
> > > > If there is general agreement on the direction, I’m happy to continue
> > > > refining the implementation.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > >
> >
>

Re: Re: [DISCUSS] Task Group Retries

Reply via email to