[GitHub] [airflow] kaxil commented on issue #6392: [AIRFLOW-5648] Add ClearTaskOperator for clearing tasks in a DAG

GitBox Tue, 05 Nov 2019 04:17:20 -0800

kaxil commented on issue #6392: [AIRFLOW-5648] Add ClearTaskOperator for 
clearing tasks in a DAG
URL: https://github.com/apache/airflow/pull/6392#issuecomment-549798868
 
 
   > > You could separate them into 2 DAGs, and use TriggerDagRunOperator after 
task K in DAG 1 to trigger DAG 2 after sensor passes True.
   > > You can have BranchPythonOperator in DAG2: to decide if it needs to just 
run L & M or run the other branch where 1st task can be TriggerDagRunOperator 
to run Dag 1. But note that this can end up in an always True loops so make 
sure that your BranchPythonOperator in Dag 2 has the correct logic.
   > 
   > Hi, @kaxil , instead of `TriggerDagRunOperator`, what do you think about 
extracting the sub dag `A, C, E, ..., J` and putting them into a 
`SubDagOperator`? I.e like this:
   > 
   > ```
   > A >> C >> E >> G >> H >> I >> J >> K >> L >> M >> Finish
   >                ^                   ^          
   >                |                   |         
   > B >> D >> F>>>>                    |
   >                                    |
   > Sensor >> SubDag_A_to_J >>>>>>>>>>>
   > 
   > 
   > where SubDag_A_to_J is a SubDagOperator containing these tasks:
   > A >> C >> E >> G >> H >> I >> J
   > ```
   > 
   > The user experience with `SubDagOperator` for this purpose seems great. 
Users see the rerun abstracted into a node on the main DAG and when they click 
on the sub dag node it zooms into the sub dag which shows what is being run. 
And most importantly for us, everything is still tightly linked together on the 
main DAG. If we need to rerun things historically it works just fine.
   > 
   > However some research shows that the internet suggests not to use 
`SubDagOperator` because it has some shortcomings, e.g. [this Astronomer 
page](https://www.astronomer.io/guides/subdags/). The most cited reason is that 
`SubDagOperator` used to cause deadlock and thus the default executor in 
Airflow 1.10.* was changed to `SequentialExecutor` which means only one task in 
the sub dag can run at a time. That is not great. But there are some workaround 
offered online, such as [this 
one](https://medium.com/@team_24989/fixing-subdagoperator-deadlock-in-airflow-6c64312ebb10)
 that uses a dedicated celery queue to fix the deadlock problem and thus can 
still let the SubDagOperator run tasks in parallel.
   > 
   > And more interestingly, the latest `SubDagOperator` in master branch has 
been made into a `BaseSensorOperator` in [this 
change](https://github.com/apache/airflow/pull/5498). The docstr indicates the 
deadlock issue can be fixed if we set mode to "reschedule". It sounds like the 
performance issue people were complaining about has already been addressed in 
the latest Airflow although that change was not released in 1.10.6. Is that the 
case? If that's the case, it sounds like we have to wait a while before we can 
safely use `SubDagOperator`.
   > 
   > ```
   > Although SubDagOperator can occupy a pool/concurrency slot,
   > user can specify the mode=reschedule so that the slot will be
   > released periodically to avoid potential deadlock.
   > ```
   
   @yuqian90 Yes if you use pool, SubDagOperator can become problematic. The 
changes to SubDagOperator may be released in Airflow 2.0


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [airflow] kaxil commented on issue #6392: [AIRFLOW-5648] Add ClearTaskOperator for clearing tasks in a DAG

Reply via email to