tiagornandrade commented on issue #21867:
URL: https://github.com/apache/airflow/issues/21867#issuecomment-2704965330
Proposal: Introducing Retry Capability for TaskGroups in Airflow
Discussion Context
Originally discussed in #21333, this request highlights a limitation in
TaskGroups compared to SubDags. While TaskGroups provide a better
organizational structure, they lack a built-in retry mechanism, which was a key
feature of SubDags.
Problem Statement
TaskGroups currently cannot be retried as a unit. This presents a challenge
in scenarios where:
1. A process needs to be repeated periodically within the same DAG
run (e.g., a retry loop).
2. A partial failure should not stop dependent tasks, but a final
failure should trigger a DAG retry.
3. Multiple instances of a task group should execute in sequence
before moving to a downstream task.
Use Case / Motivation
Example Scenario:
• Task A (PythonOperator) collects data.
• Task B (PostgresOperator) updates a materialized view in
PostgreSQL.
• Task A might partially fail but still produce some data.
• Task B must run regardless of A’s outcome
(trigger_rule="all_done").
• This process should repeat every hour within the same DAG run
until a defined limit is reached.
Challenge:
• SubDags previously solved this by allowing retries via
retries=10, along with a DummyOperator (C) to mark completion.
• TaskGroups lack the retries parameter, making it impossible to
achieve this behavior within a single DAG.
• Retrying the entire DAG is not an option because the DAG is too
large.
Proposed Solution
1. Introduce a Retry Mechanism for TaskGroups
• Allow setting retries on TaskGroups similar to individual tasks.
• Implement a mechanism where all tasks in a TaskGroup can be
re-executed without affecting other parts of the DAG.
2. Alternative Workarounds Considered
• Using an external DAG trigger:
• Creates unnecessary complexity (requires maintaining two DAGs).
• Clearing failed tasks manually:
• Can lead to infinite loops unless a retry limit is enforced.
Proposed Implementation Strategy
• Introduce a retries parameter for TaskGroups.
• Implement an execution mode where TaskGroups behave like
sub-workflows, allowing controlled retries.
• Ensure that TaskGroup retries do not reset the entire DAG state.
Related Issues
No direct issues reported, but this proposal aligns with use cases discussed
in #21333.
Next Steps
• Gather feedback from the community.
• Explore feasibility within the Airflow execution model.
• Implement a proof of concept for TaskGroup retries.
• Submit a PR with the proposed changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]