Hello, I have skimmed over the PR, overall I have to say that it looks good. I have yet to find a use case for this (as I just can't think of one) where I find the feature useful, and I will appreciate it if you could give an example use case for the feature, as it looks like quite a bit of changes have been introduced (including a new table and new dependency types) for a feature which allows for task groups to be retried.
I would love to hear about what the use case of the feature is, as I just can't think of one, I think that it might be simpler to implement if we do something like a composite task instance, yet I do not want to propose anything before I hear mroe about the use case, as I am most likely just missing something. Best regards, Natanel. On Wed, 18 Feb 2026 at 17:49, Jorge Rocamora García < [email protected]> wrote: > Hi all, > > I’d like to start a discussion around Task Group retries. > > Issue: https://github.com/apache/airflow/issues/21867 > PR: https://github.com/apache/airflow/pull/61809 > > This PR introduces a proof of concept for TaskGroup retries, allowing a > whole TaskGroup to be retried as a unit rather than relying only on > individual task retries. > > In addition to standard retry parameters (retries, retry_delay, > exponential backoff, etc.), this proposal introduces TaskGroup-specific > retry semantics, including: > > > * > retry_condition: allows defining when a group should be retried (e.g., > based on aggregated task states), enabling more flexible policies than > simple failure-based retries. > * > retry_fast_fail: enables fail-fast behavior within the group, so that once > a retry-triggering condition is met, the group can short-circuit remaining > tasks and move directly to retry handling. > > The implementation adds retry configuration to TaskGroup, introduces a > task_group_instance model to persist retry state per DagRun, and includes > scheduler logic to evaluate retry conditions, enforce delay/backoff, and > clear group tasks for subsequent attempts. The feature is opt-in and does > not affect existing DAGs unless configured. > > I’d appreciate feedback on: > > > * > The proposed API. > * > The scheduler and state-management approach. > * > The new model/migration. > * > Whether the retry semantics feel intuitive and consistent with existing > task-level retries. > * > .. > > If there is general agreement on the direction, I’m happy to continue > refining the implementation. > > Best, > Jorge > >
