aeroyorch opened a new pull request, #61809: URL: https://github.com/apache/airflow/pull/61809
# Summary This PR is a first proof of concept for `TaskGroup` retries (related to: https://github.com/apache/airflow/issues/21867), intended to get an end‑to‑end implementation in place so we can iterate on behavior and UX in follow‑ups. # Key Changes - Add `TaskGroup` retry configuration in the SDK (`retries`, `retry_delay`, `retry_exponential_backoff`, `max_retry_delay`, `retry_condition`, `retry_fast_fail`). - Persist retry state per `TaskGroup` per `DagRun` via new `task_group_instance` model and migration. - Implement scheduler logic to evaluate `TaskGroup` retry conditions, clear group tasks for another attempt, and enforce retry delay. - Add a `TaskGroup` retry dependency to block scheduling while a group is waiting for its retry delay. - Add unit/integration tests for retry behavior, delay/backoff, and fast‑fail sibling cancellation. # Retry Group Options - `retry_condition`: Controls when a `TaskGroup` is considered failed and should retry. Built‑ins: `any_failed` (default), `all_failed`. Can be a callable for custom logic, receiving task instances and optional context (`task_group`, `task_group_id`, `ti`). - `retry_fast_fail`: Controls how quickly remaining group tasks are stopped once the retry condition is met. - `False` (default): let remaining tasks finish naturally. - `True`: running tasks are forced to fail, queued/scheduled tasks are skipped (teardown tasks are respected), enabling faster retry loops. # Design Notes I did not add support for restarting only failing tasks (a `retry_strategy` or similar). That behavior is already covered by `TaskInstance` retries, so partial `TaskGroup` retries did not add meaningful value in this initial implementation. ## UI Note No UI changes were introduced in this PR. A potential follow‑up could be adding a display‑only state like `“Up for Group Retry”` to indicate tasks waiting on a `TaskGroup` retry delay. ## No Partial (Selective) TaskGroup Retries (for now) This implementation does not support restarting only the failing tasks within a `TaskGroup`. My reasoning is that selective retries are already covered by `TaskInstance` retries, and introducing partial group retries at this stage would add semantic overlap and scheduler complexity without a clearly distinct use case. That said, this is not a hard constraint of the design. If there are compelling scenarios where partial `TaskGroup` retries provide meaningful value beyond `TaskInstance` retries, we can revisit the model. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
