thanks Laura, it helps! i was hoping you would reply :) very good points about UI / logs / restarts - I think at this point I really like #2 option myself.
I still wonder if people do something creative to generate complex DAGs outside of a DAG folder - so this would be an example when it takes significant time to poll metadata/databases to generate all the tasks. I do not know if it is possible as I am not strong with Python (actually I have been learning Python as I am learning Airflow!) The idea is to have an outside py to generate static .py file for a DAG/s and place these generated py files under airflow dag_folder once a day or on some schedule. Is anyone doing this or I am over-complicating things and #2 should just work? I think in my case it might take a good minute to parse out metadata files and some database tables to actually generate DAG tasks. Also I imagine it will produce a heck of log records since scheduler polls dag folders every minute and this process will repeat again itself in a minute - so it will be like doing this non-stop unless I change airflow scheduler settings. On Fri, Oct 21, 2016 at 11:39 AM, Laura Lorenz <llor...@industrydive.com> wrote: > We've been evolving from type 1 you describe to a pull/poll version of the > type 2 you describe. For type 1, it is really hard to tell what's going on > (all the UI views become useless because they are so huge). Having one big > dag also means you can't turn off the scheduler for individual parts, and > the whole DAG fails if one task does, so if you can functionally separate > them I think that gives you more configuration options. Our biggest DAG now > is more like 22*10 tasks, which is still too big in our opinions. We > leverage ExternalTaskSensors to link dags together which is more of a > pull/poll paradigm, but you could use a TriggerDagRunOperator if you wanted > more of a push/trigger paradigm which is what I hea ryou saying in type 2. > > To your second question, our DAGs are dynamic based on the results of an > API call we embed in the DAG and our scheduler is on a 5-second timelapse > for each attemp to refill the DagBag. I think because of the frequency of > the scheduler polling the files, because our API call is relatively fast, > we are working with DAGs that are on a 24 hour schedule_interval, and the > resultant DAG structure is not too large or complicated, we haven't had any > issues with that or done anything special. I think it's just the fact of > the matter that if you give the scheduler a lot of work to do to determine > the DAG shape, it will take a while. > > Laura > > On Fri, Oct 21, 2016 at 10:52 AM, Boris Tyukin <bo...@boristyukin.com> > wrote: > > > Guys, would you mind to chime in and share your experience? > > >