aeroyorch opened a new pull request, #61809:
URL: https://github.com/apache/airflow/pull/61809

   # Summary
   
   This PR is a first proof of concept for `TaskGroup` retries (related to: 
https://github.com/apache/airflow/issues/21867), intended to get an end‑to‑end 
implementation in place so we can iterate on behavior and UX in follow‑ups.
   
   # Key Changes
   
   - Add `TaskGroup` retry configuration in the SDK (`retries`, `retry_delay`, 
`retry_exponential_backoff`, `max_retry_delay`, `retry_condition`, 
`retry_fast_fail`).
   - Persist retry state per `TaskGroup` per `DagRun` via new 
`task_group_instance` model and migration.
   - Implement scheduler logic to evaluate `TaskGroup` retry conditions, clear 
group tasks for another attempt, and enforce retry delay.
   - Add a `TaskGroup` retry dependency to block scheduling while a group is 
waiting for its retry delay.
   - Add unit/integration tests for retry behavior, delay/backoff, and 
fast‑fail sibling cancellation.
   
   # Retry Group Options
   
   - `retry_condition`: Controls when a `TaskGroup` is considered failed and 
should retry. Built‑ins: `any_failed` (default), `all_failed`. Can be a 
callable for custom logic, receiving task instances and optional context 
(`task_group`, `task_group_id`, `ti`).
   - `retry_fast_fail`: Controls how quickly remaining group tasks are stopped 
once the retry condition is met.
       - `False` (default): let remaining tasks finish naturally.
       - `True`: running tasks are forced to fail, queued/scheduled tasks are 
skipped (teardown tasks are respected), enabling faster retry loops.
   
   # Design Notes
   
   I did not add support for restarting only failing tasks (a `retry_strategy` 
or similar). That behavior is already covered by `TaskInstance` retries, so 
partial `TaskGroup` retries did not add meaningful value in this initial 
implementation.
   
   ## UI Note
   
   No UI changes were introduced in this PR. A potential follow‑up could be 
adding a display‑only state like `“Up for Group Retry”` to indicate tasks 
waiting on a `TaskGroup` retry delay.
   
   ## No Partial (Selective) TaskGroup Retries (for now)
   
   This implementation does not support restarting only the failing tasks 
within a `TaskGroup`.
   
   My reasoning is that selective retries are already covered by `TaskInstance` 
retries, and introducing partial group retries at this stage would add semantic 
overlap and scheduler complexity without a clearly distinct use case.
   
   That said, this is not a hard constraint of the design. If there are 
compelling scenarios where partial `TaskGroup` retries provide meaningful value 
beyond `TaskInstance` retries, we can revisit the model.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to