tiagornandrade commented on issue #21867:
URL: https://github.com/apache/airflow/issues/21867#issuecomment-2704965330

   Proposal: Introducing Retry Capability for TaskGroups in Airflow
   
   Discussion Context
   
   Originally discussed in #21333, this request highlights a limitation in 
TaskGroups compared to SubDags. While TaskGroups provide a better 
organizational structure, they lack a built-in retry mechanism, which was a key 
feature of SubDags.
   
   Problem Statement
   
   TaskGroups currently cannot be retried as a unit. This presents a challenge 
in scenarios where:
        1.      A process needs to be repeated periodically within the same DAG 
run (e.g., a retry loop).
        2.      A partial failure should not stop dependent tasks, but a final 
failure should trigger a DAG retry.
        3.      Multiple instances of a task group should execute in sequence 
before moving to a downstream task.
   
   Use Case / Motivation
   
   Example Scenario:
        •       Task A (PythonOperator) collects data.
        •       Task B (PostgresOperator) updates a materialized view in 
PostgreSQL.
        •       Task A might partially fail but still produce some data.
        •       Task B must run regardless of A’s outcome 
(trigger_rule="all_done").
        •       This process should repeat every hour within the same DAG run 
until a defined limit is reached.
   
   Challenge:
        •       SubDags previously solved this by allowing retries via 
retries=10, along with a DummyOperator (C) to mark completion.
        •       TaskGroups lack the retries parameter, making it impossible to 
achieve this behavior within a single DAG.
        •       Retrying the entire DAG is not an option because the DAG is too 
large.
   
   Proposed Solution
        1.      Introduce a Retry Mechanism for TaskGroups
        •       Allow setting retries on TaskGroups similar to individual tasks.
        •       Implement a mechanism where all tasks in a TaskGroup can be 
re-executed without affecting other parts of the DAG.
        2.      Alternative Workarounds Considered
        •       Using an external DAG trigger:
        •       Creates unnecessary complexity (requires maintaining two DAGs).
        •       Clearing failed tasks manually:
        •       Can lead to infinite loops unless a retry limit is enforced.
   
   Proposed Implementation Strategy
        •       Introduce a retries parameter for TaskGroups.
        •       Implement an execution mode where TaskGroups behave like 
sub-workflows, allowing controlled retries.
        •       Ensure that TaskGroup retries do not reset the entire DAG state.
   
   Related Issues
   
   No direct issues reported, but this proposal aligns with use cases discussed 
in #21333.
   
   Next Steps
        •       Gather feedback from the community.
        •       Explore feasibility within the Airflow execution model.
        •       Implement a proof of concept for TaskGroup retries.
        •       Submit a PR with the proposed changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to