Hi all, I’d like to start a discussion around Task Group retries.
Issue: https://github.com/apache/airflow/issues/21867 PR: https://github.com/apache/airflow/pull/61809 This PR introduces a proof of concept for TaskGroup retries, allowing a whole TaskGroup to be retried as a unit rather than relying only on individual task retries. In addition to standard retry parameters (retries, retry_delay, exponential backoff, etc.), this proposal introduces TaskGroup-specific retry semantics, including: * retry_condition: allows defining when a group should be retried (e.g., based on aggregated task states), enabling more flexible policies than simple failure-based retries. * retry_fast_fail: enables fail-fast behavior within the group, so that once a retry-triggering condition is met, the group can short-circuit remaining tasks and move directly to retry handling. The implementation adds retry configuration to TaskGroup, introduces a task_group_instance model to persist retry state per DagRun, and includes scheduler logic to evaluate retry conditions, enforce delay/backoff, and clear group tasks for subsequent attempts. The feature is opt-in and does not affect existing DAGs unless configured. I’d appreciate feedback on: * The proposed API. * The scheduler and state-management approach. * The new model/migration. * Whether the retry semantics feel intuitive and consistent with existing task-level retries. * .. If there is general agreement on the direction, I’m happy to continue refining the implementation. Best, Jorge
