dstandish commented on issue #21867:
URL: https://github.com/apache/airflow/issues/21867#issuecomment-1678858953
Warning... this is a long one... just tried to catch up on the current
state and share some thoughts and some reflections based on learnings in AIP-52
> We need also to consider the case of TaskGroup inside TaskGroup. Would the
inner TaskGroup be considered a single leaf node?
And
>> the definition of last can be tricky as TaskGroup can have multiple leaf
nodes
> good point. I'll start without the retry_condition, implementing only the
"any failed" behavior, and this can be dealt with later on.
So, a task group's leaves are well-defined, no matter how many task groups
are contained in it. It's {x for x in group if not
has_downstream_task_in_group(x)}. And the rule for determining dag run state
is if any of those leaves is failed / upstream failed.
So this seems like a non-issue, except for ... do you allow a _retryable_
group within a retryable group... oy... I guess in principle there's no
obvious problem. The retry of the outer implies retry of the inner.
Concerning this debate:
>> I don't see this as very valuable. This option doesn't exist for DAGs,
and I think setting individual task retries covers enough of that need.
> I had added that requirement in response to below comment. Users could
have any number of tasks in their task group and providing them flexibility on
what should be retried would help. That said, yes, this could be available as
an API/UI instead of configuration for task group model. Interested to hear
what others think about it.
>> What if TaskGroup contains of 20 tasks and one of them is HTTP that just
retired on some trasist error. Do we really want to retry the whole task group
because of it?
A few related notes before discussion of this.
During AIP-52, late in the process, the idea emerged to add a flag to
operator which would in effect mean "should i be considered for dag run state".
Essentially, letting any task be "opt out" if it is a leaf. Currently
teardown tasks are ignored in this way by default, but it's overridable.
Though we could extend this _option_ to all tasks. And this could be
generalized to "should i be considered for _container_ state" i.e. group or dag.
Similarly, there is another flag that could come into existence "should i be
counted as a leaf when arrowing from task group". Again, from teardowns are by
default ignored in this scenario. And this isn't overridable, but, you can
simply add the downstream relationship after the fact if you want it. Again,
not implemented but could be.
And now coming back to the above discussion. It might make sense to simply
delegate to the task itself whether it should be rerun on retries of the
container (group or dag).
Certainly though, clearing should _always_ clear all the tasks in the
container, no?
> This is exactly the reason I kept retry_condition programmable. Airflow
users should be able to define whatever condition they prefer to trigger a
retry. Airflow has so far provided flexibility and if you do not add it here, I
highly recommend adding an option to trigger retry on-demand with UI/API.
So yeah, I think given the above discussion, delegating to task seems
simpler / more flexible, no?
> D just failed and B is still running. I already know that I'll need to
retry the task group, should I cancel B ? If I let it complete, should I also
let C run even if I know I'll re-run it ?
> What I'm leaning towards is that (assuming we're in "retry if any failed"
policy) as soon as a task fails, we stop everything and restart from the
beginning, but I wonder if there is any bad side effect to that.
and
> That's one idea, another one would be adding something like a parameter
that allows the cancelation of any running task. So the user could choose if he
wants running tasks cancelled or wait their completion.
You may not be aware, but recently added was a `fail_stop` feature that
controls this behavior. May not need to mess with that here.
> If I don't cancel B, can I still start a new run of A while it finishes
the previous try ? And D after that ? That might be greedy, but it'd save time.
that seems problematic.
> Also, what status should tasks have when their task group is going to be
retried ? Their last know status ? (succeeded, failed, ...) ? Or up_for_retry,
erasing the previous status?
clear, i.e. no status, would seem to be the obvious choice, consistent with
current behavior. a task can't be run if it's in "success" state, why delay
it. indeed, even if you don't delay it, i can imagine a race condition where
the dag may _end_ before there's a chance to retry the task group; of course
this would be more problematic with a delay.
also consider the following... note first that you can have things outside a
group which are downstream of things inside a group. so, what happens when you
have tasks that branch off downstream from tasks in the middle of your
retryable task group. in effect, by retrying your task group, you are clearing
"upstream". But do you also "recurse downstream", i.e. any tasks that are
cleared _in_ the group will be recursed downstream and everything else even
outside the group will also be cleared.
in light of this, perhaps a retryable task group should not be able to have
any external downstreams apart from those that connect to its leaves. indeed,
there may still be a problem with that because if there are multiple leaves in
the group, then maybe one is successful and one is failed, and maybe the
successful leaf has already triggered there downstreams, which are all done
now, but the other leaf failed so the group retries, so now do you clear the
downstreams of the success branch? probably not. in view of this perhaps what
you want is more like SubDag, where you can't have inter-dag task deps (so no
arrowing directly to tasks in a retryable group) and all you can do is wait for
the black box to complete successfully.
Indeed thinking this over it crossed my mind that perhaps what we want is to
reimplement subdag as an inline construct, something different from task groups.
```python
with dag:
with TaskGroup:
with SubDag:
with TaskGroup:
```
One thing that I encountered when working on setup and teardown was, when we
first started on it I think it was having taskgroup take on too much. I.e. it
wanted to make setup and teardown attributes of taskgroup. And this interfered
with taskgroup as an arbitrary container of logic, because it added the
constraint that a setup or teardown could only happen at begin or end of task
group and only one each in a group. And this would require a either a big
refactor of many dags, or disuse of setup / teardown. And it interfered with
the use of taskgroup as an arbitrary container of tasks be it for visual
grouping or business logic. And with taskgroups also being an authoring tool
for mapping, same isue there.
With retrying, it actually seems substantially less problematic than what I
describe with AIP-52, but I share the anecdote simply to note that, we should
be mindful of having the one construct do too much. Not that we are
necessarily doing that with this. The other side of that is the risk of
proliferation of too many similar constructs. But, I think inline SubDag has
at least some potential worth giving a thought.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]