Hi all, Just want to bump this thread up again as I have a PR (https://github.com/apache/airflow/pull/5498 <https://github.com/apache/airflow/pull/5498>) that makes SubDagOperator use normal scheduler instead of backfill scheduler to schedule tasks in subdags and would love to have some feedback from you guys.
Chao-Han On 2019/05/16 05:40:53, Chao-Han Tsai <[email protected]> wrote: > Hi all,> > > I have a WIP PR <https://github.com/apache/airflow/pull/5279/files> that> > aims to make SubDagOperator to use normal scheduler instead of the backfill> > scheduler and would love to have some feedbacks from you with regard to the> > implementation.> > > The basic idea is that we create a DagRun when executing the SubDagOperator> > and wait until the DagRun to finish. Airflow scheduler picks up the DagRun> > and executes the tasks.> > > Also I have another PR (https://github.com/apache/airflow/pull/5283) that> > consistently set the is_paused state for parent DAGs and all of its> > subdags, which I believe is required if we want to utilize normal scheduler> > to schedule tasks in SubDagOperator.> > > Thanks,> > > Chao-Han> > > > > On Mon, Apr 15, 2019 at 4:39 AM Dan Davydov <[email protected]>> > wrote:> > > > I don't think fixing subdags to run in the scheduler is enough, although> > > it's a huge improvement over the current implementation (especially the> > > part that lets Subdags specify custom executors). From my experience with> > > Subdags, I think what makes more sense is adding various operators to > > allow> > > combining regular DAGs.> > >> > > Here are a some other issues with Subdags off the top of my head:> > > - Confusing/separate UI and clearing/running semantics (e.g. tasks in the> > > Subdag will not get scheduled if you clear them but not the parent> > > operator)> > > - Nested Subdags are hard to work with in the UI (and IIRC don't behave> > > correctly but I might be wrong on this).> > > - The abstraction is confusing, e.g. looking at the log for the> > > SubdagOperator task can be a bit confusing as> > > - Tons of custom special-case logic in the Airflow code and schemas in the> > > DB to handle Subdags which have led to a lot of complexity and a constant> > > source of tricky bugs and upgrade issues> > > - Additional abstraction that users have to learn> > >> > > An alternative would be allowing combining DAGs, e.g. something like:> > > dag = DAG()> > > dag_task1 = Op(dag = dag)> > > dag_task2 = Op(dag = dag)> > >> > > subdag = DAG()> > > subdag_task1 = Op(dag = subdag)> > > subdag_task2 = Op(dag = subdag)> > >> > > dag_task1 >> subdag >> dag_task2> > > # Results in the following topology:> > > # ,-> subdag_task1 ---v> > > # dag_task1 dag_task2> > > # '-> subdag_task2----^> > >> > > This is also a lot more easily composable than subdags, and provides a > > more> > > powerful abstraction, e.g. you don't need additional boilerplate to create> > > subdags such as setting up an operator.> > >> > > On Sat, Apr 13, 2019 at 12:14 PM Felix Uellendall <[email protected]> > > >> > > wrote:> > >> > > > -1 on deprecating subdags, because of the extra level of abstraction> > > > some of you already mentioned.> > > >> > > > We also use subdags in production.> > > >> > > > For example in cases where we get json data from an API but since we> > > > mostly need it to be in csv format we have a subdag like> > > > /specific_ap//i_specific_endpoint/_to_s3 that has two tasks one for> > > > retrieving data from the API and loading it to s3 and one for> > > > transforming it into csv.> > > > This has the advantage that you don't need to think of how you transform> > > > the json to csv. In our case Data Analyst don't want to think about> > > > that. They want to work with tabular data.> > > >> > > > We also have subdags that are handling API cursoring/pagination (by> > > > using xcom) and merging these multiple API response data into one file.> > > > So you call one Task, a subdag operator with this subdag and get only> > > > what you really need - the data.> > > >> > > > I really like subdags and I am for improving / maybe redesigning or> > > > reimplementing of subdags.> > > >> > > > -feluelle> > > >> > > > Am 13/04/2019 um 07:52 schrieb Chao-Han Tsai:> > > > > +1 on keeping it.> > > > >> > > > > I think we should keep the SubDags as it provides a good abstraction> > > > layer.> > > > > It just need some love from us to fix the underlying> > > > > performance/reliability issues.> > > > >> > > > > On Fri, Apr 12, 2019 at 12:06 PM Ash Berlin-Taylor <[email protected]>> > > > wrote:> > > > >> > > > >> This is what I was thinking - the dag collector in the scheduler> > > should> > > > >> "just" be able to collect the tasks for subdags up to the parent dag.> > > > I'd> > > > >> possibly go as far as saying no DagRun object for subdags too.> > > > >>> > > > >> (Yes, "just" will never be that simple).> > > > >>> > > > >> -a> > > > >>> > > > >> On 12 April 2019 18:37:24 BST, Bolke de Bruin <[email protected]>> > > > wrote:> > > > >>> +1> > > > >>>> > > > >>> Sub dags should be fixed within the scheduler and run normally.> > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>> On 12 April 2019 at 19:36:27, Feng Lu ([email protected])> > > > >>> wrote:> > > > >>>> > > > >>> Agree with others who think SubDag should stay, we should fix the> > > > >>> SubDag> > > > >>> implementation but not remove the abstraction itself.> > > > >>>> > > > >>> On Fri, Apr 12, 2019 at 8:42 AM Chen Tong <[email protected]> wrote:> > > > >>>> > > > >>>> Is it possible to re-implement it in the view-level, not in > > > >>>> operator> > > > >>> level?> > > > >>>> And this operator is just define a different view in GUI, that > > > >>>> these> > > > >>> tasks> > > > >>>> will be collapsed into another view.> > > > >>>>> > > > >>>> On Fri, Apr 12, 2019 at 11:31 AM James Meickle> > > > >>>> <[email protected]> wrote:> > > > >>>>> > > > >>>>> I have avoided using them because of outstanding issues like the> > > > >>> open> > > > >>>> JIRA> > > > >>>>> issues I linked above, or similar issues that I've read about on> > > > >>> blog> > > > >>>>> posts. If it were just GUI or UX issues I'd use them, but many> > > > >>> people> > > > >>>> have> > > > >>>>> reported issues which affect concurrency/stability, consistency, > > > >>>>> or> > > > >>>>> correctness of results. I believe that it's working for you, but> > > > >>> for> > > > >>> me,> > > > >>>>> it's not worth the risk to build using them in my environment > > > >>>>> (even> > > > >>>> though> > > > >>>>> they could be handy for many of our workflows).> > > > >>>>>> > > > >>>>> On Fri, Apr 12, 2019 at 11:18 AM Kaxil Naik <[email protected]>> > > > >>> wrote:> > > > >>>>>> I have been using SubDags in production and haven't had much> > > > >>> problem> > > > >>>> with> > > > >>>>>> it.> > > > >>>>>>> > > > >>>>>> Can you list the issues you had?> > > > >>>>>>> > > > >>>>>> Regards,> > > > >>>>>> Kaxil> > > > >>>>>>> > > > >>>>>>> > > > >>>>>> On Fri, Apr 12, 2019, 16:16 James Meickle> > > > >>> <[email protected]> > > > >>>>>> .invalid>> > > > >>>>>> wrote:> > > > >>>>>>> > > > >>>>>>> Given their bad reputation, would it be appropriate to> > > > >>> deprecate> > > > >>>>> subDAGs> > > > >>>>>>> now to advertise that they're no longer considered a suitable> > > > >>>>>>> implementation? If a new and better implementation is created,> > > > >>> would> > > > >>>> it> > > > >>>>>>> even be similar enough to subDAGs that we'd want to continue to> > > > >>> call> > > > >>>>> the> > > > >>>>>>> feature that?> > > > >>>>>>>> > > > >>>>>>> They feel like a "new Airflow user trap" right now - I have had> > > > >>> to> > > > >>>> tell> > > > >>>>>> my> > > > >>>>>>> team never to use them, because they seem appealing and are in> > > > >>> the> > > > >>>>>> official> > > > >>>>>>> docs.> > > > >>>>>>>> > > > >>>>>>> On Fri, Apr 12, 2019 at 10:51 AM Ash Berlin-Taylor> > > > >>> <[email protected]>> > > > >>>>>> wrote:> > > > >>>>>>>> I'd like to find time to fix subdags as they do provide a> > > > >>> useful> > > > >>>>>>>> abstraction - but I agree right now they aren't great (I> > > > >>> avoid> > > > >>> them> > > > >>>>>>> because> > > > >>>>>>>> of this)> > > > >>>>>>>>> > > > >>>>>>>> I have half thoughts of how to it should work, I just need to> > > > >>> look> > > > >>>> at> > > > >>>>>> the> > > > >>>>>>>> code in depth to see if that makes sense. Now 1.10.3 is out I> > > > >>> might> > > > >>>>>> have> > > > >>>>>>> a> > > > >>>>>>>> bit more time to do this.> > > > >>>>>>>>> > > > >>>>>>>> -ash> > > > >>>>>>>>> > > > >>>>>>>>> On 12 Apr 2019, at 15:48, James Meickle> > > > >>> <[email protected]> > > > >>>>>>> .INVALID>> > > > >>>>>>>> wrote:> > > > >>>>>>>>> I think we should deprecate SubDAGs given the complexity> > > > >>> they> > > > >>> add> > > > >>>>> and> > > > >>>>>>> the> > > > >>>>>>>>> limited usage and use cases. Or, we should invest effort in> > > > >>>>>> redesigning> > > > >>>>>>>>> their API and implementation. I think that having to> > > > >>> account> > > > >>> for> > > > >>>>>>>>> subdag-introduced complexity makes Airflow's code much> > > > >>> harder> > > > >>> to> > > > >>>>>>> maintain> > > > >>>>>>>>> and buggier, looking at how many open issues there are that> > > > >>>>> reference> > > > >>>>>>>>> subdags (and how unrelated in topic they are):> > > > >>>>>>>>>> > > > >>> > > >> > > https://issues.apache.org/jira/browse/AIRFLOW-3292?jql=project%20%3D%20AIRFLOW%20AND%20status%20%3D%20Open%20AND%20text%20~%20%22subdag%22> > > > > > >>>>>>>>> > > > >> > > >> > >> > > > -- > > > Chao-Han Tsai> >
