Hi all, I’d like to start a discussion around Task Group retries.
Issue: https://github.com/apache/airflow/issues/21867 PR: https://github.com/apache/airflow/pull/61809 This PR introduces a proof of concept for TaskGroup retries, allowing a whole TaskGroup to be retried as a unit rather than relying only on individual task retries. In addition to standard retry parameters (retries, retry_delay, exponential backoff, etc.), this proposal introduces TaskGroup-specific retry semantics, including: - *retry_condition*: allows defining when a group should be retried (e.g., based on aggregated task states), enabling more flexible policies than simple failure-based retries. - *retry_fast_fail*: enables fail-fast behavior within the group, so that once a retry-triggering condition is met, the group can short-circuit remaining tasks and move directly to retry handling. The implementation adds retry configuration to TaskGroup, introduces a task_group_instance model to persist retry state per DagRun, and includes scheduler logic to evaluate retry conditions, enforce delay/backoff, and clear group tasks for subsequent attempts. The feature is opt-in and does not affect existing DAGs unless configured. I’d appreciate feedback on: - The proposed API. - The scheduler and state-management approach. - The new model/migration. - Whether the retry semantics feel intuitive and consistent with existing task-level retries. - .. If there is general agreement on the direction, I’m happy to continue refining the implementation. Best, Jorge
