[GitHub] [airflow] dstandish commented on issue #21867: Retry for TaskGroup

via GitHub Tue, 15 Aug 2023 05:36:22 -0700


dstandish commented on issue #21867:
URL: https://github.com/apache/airflow/issues/21867#issuecomment-1678858953


   Warning... this is a long one...  just tried to catch up on the current 
state and share some thoughts and some reflections based on learnings in AIP-52
   
   
   > We need also to consider the case of TaskGroup inside TaskGroup. Would the 
inner TaskGroup be considered a single leaf node?
   
   And 
   
   >> the definition of last can be tricky as TaskGroup can have multiple leaf 
nodes
   > good point. I'll start without the retry_condition, implementing only the 
"any failed" behavior, and this can be dealt with later on.
   
   So, a task group's leaves are well-defined, no matter how many task groups 
are contained in it.  It's {x for x in group if not 
has_downstream_task_in_group(x)}.  And the rule for determining dag run state 
is if any of those leaves is failed / upstream failed.
   
   So this seems like a non-issue, except for ... do you allow a _retryable_ 
group within a retryable group... oy...  I guess in principle there's no 
obvious problem.  The retry of the outer implies retry of the inner.
   
   Concerning this debate:
   
   >> I don't see this as very valuable. This option doesn't exist for DAGs, 
and I think setting individual task retries covers enough of that need.
   > I had added that requirement in response to below comment. Users could 
have any number of tasks in their task group and providing them flexibility on 
what should be retried would help. That said, yes, this could be available as 
an API/UI instead of configuration for task group model. Interested to hear 
what others think about it.
   >> What if TaskGroup contains of 20 tasks and one of them is HTTP that just 
retired on some trasist error. Do we really want to retry the whole task group 
because of it?
   
   A few related notes before discussion of this.
   
   During AIP-52, late in the process, the idea emerged to add a flag to 
operator which would in effect mean "should i be considered for dag run state". 
 Essentially, letting any task be "opt out" if it is a leaf.  Currently 
teardown tasks are ignored in this way by default, but it's overridable.  
Though we could extend this _option_ to all tasks.  And this could be 
generalized to "should i be considered for _container_ state" i.e. group or dag.
   
   Similarly, there is another flag that could come into existence "should i be 
counted as a leaf when arrowing from task group".  Again, from teardowns are by 
default ignored in this scenario.  And this isn't overridable, but, you can 
simply add the downstream relationship after the fact if you want it.  Again, 
not implemented but could be.
   
   And now coming back to the above discussion.  It might make sense to simply 
delegate to the task itself whether it should be rerun on retries of the 
container (group or dag).
   
   Certainly though, clearing should _always_ clear all the tasks in the 
container, no?
   
   > This is exactly the reason I kept retry_condition programmable. Airflow 
users should be able to define whatever condition they prefer to trigger a 
retry. Airflow has so far provided flexibility and if you do not add it here, I 
highly recommend adding an option to trigger retry on-demand with UI/API.
   
   So yeah, I think given the above discussion, delegating to task seems 
simpler / more flexible, no?
   
   > D just failed and B is still running. I already know that I'll need to 
retry the task group, should I cancel B ? If I let it complete, should I also 
let C run even if I know I'll re-run it ?
   > What I'm leaning towards is that (assuming we're in "retry if any failed" 
policy) as soon as a task fails, we stop everything and restart from the 
beginning, but I wonder if there is any bad side effect to that.
   
   and 
   
   > That's one idea, another one would be adding something like a parameter 
that allows the cancelation of any running task. So the user could choose if he 
wants running tasks cancelled or wait their completion.
   
   You may not be aware, but recently added was a `fail_stop` feature that 
controls this behavior.  May not need to mess with that here.
   
   > If I don't cancel B, can I still start a new run of A while it finishes 
the previous try ? And D after that ? That might be greedy, but it'd save time.
   
   that seems problematic.
   
   > Also, what status should tasks have when their task group is going to be 
retried ? Their last know status ? (succeeded, failed, ...) ? Or up_for_retry, 
erasing the previous status?
   
   clear, i.e. no status, would seem to be the obvious choice, consistent with 
current behavior.  a task can't be run if it's in "success" state, why delay 
it.  indeed, even if you don't delay it, i can imagine a race condition where 
the dag may _end_ before there's a chance to retry the task group; of course 
this would be more problematic with a delay.
   
   also consider the following... note first that you can have things outside a 
group which are downstream of things inside a group.  so, what happens when you 
have tasks that branch off downstream from tasks in the middle of your 
retryable task group.  in effect, by retrying your task group, you are clearing 
"upstream".  But do you also "recurse downstream", i.e. any tasks that are 
cleared _in_ the group will be recursed downstream and everything else even 
outside the group will also be cleared.
   
   in light of this, perhaps a retryable task group should not be able to have 
any external downstreams apart from those that connect to its leaves. indeed, 
there may still be a problem with that because if there are multiple leaves in 
the group, then maybe one is successful and one is failed, and maybe the 
successful leaf has already triggered there downstreams, which are all done 
now, but the other leaf failed so the group retries, so now do you clear the 
downstreams of the success branch?  probably not.  in view of this perhaps what 
you want is more like SubDag, where you can't have inter-dag task deps (so no 
arrowing directly to tasks in a retryable group) and all you can do is wait for 
the black box to complete successfully.  
   
   Indeed thinking this over it crossed my mind that perhaps what we want is to 
reimplement subdag as an inline construct, something different from task groups.
   
   ```python
   with dag:
       with TaskGroup:
           with SubDag:
               with TaskGroup:
   ```
   
   One thing that I encountered when working on setup and teardown was, when we 
first started on it I think it was having taskgroup take on too much.  I.e. it 
wanted to make setup and teardown attributes of taskgroup.  And this interfered 
with taskgroup as an arbitrary container of logic, because it added the 
constraint that a setup or teardown could only happen at begin or end of task 
group and only one each in a group.  And this would require a either a big 
refactor of many dags, or disuse of setup / teardown. And it interfered with 
the use of taskgroup as an arbitrary container of tasks be it for visual 
grouping or business logic.  And with taskgroups also being an authoring tool 
for mapping, same isue there.
   
   With retrying, it actually seems substantially less problematic than what I 
describe with AIP-52, but I share the anecdote simply to note that, we should 
be mindful of having the one construct do too much.  Not that we are 
necessarily doing that with this.  The other side of that is the risk of 
proliferation of too many similar constructs.  But, I think inline SubDag has 
at least some potential worth giving a thought.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] dstandish commented on issue #21867: Retry for TaskGroup

Reply via email to