I've recently upgraded to 1.8.0 and immediately encountered the hanging
SubDag issue that's been mentioned. I'm not sure the rollback from rc5 to
rc4 fixed the issue.  For now I've removed all SubDags and put their
task_instances in the main DAG.

Assuming this issue gets fixed, how is one supposed to recover from
failures within SubDags after the # of retries have maxed?  Previously, I
would clear the state of the offending tasks and run a backfill job.
Backfill jobs in 1.7.1 would skip successful task_instances and only run
the task_instances with cleared states. Now, backfills and SubDagOperators
clear the state of successful tasks. I'd rather not re-run a task that
already succeeded. I tried running backfills with --task_regex and
--ignore_dependencies, but that doesn't quite work either.

If I have t1(success) -> t2(clear) -> t3(clear) and I set --task_regex so
that it excludes t1, then t2 will run, but t3 will never run because it
doesn't wait for t2 to finish. It fails because its upstream dependency
condition is not met.

I like the logical grouping that SubDags provide, but I don't want all
retry all tasks even if they're successful. I can see why one would want
that behavior in some cases, but it's certainly not useful in all.

Reply via email to