Re: Best practices for dynamically generated tasks and dags

Laura Lorenz Fri, 21 Oct 2016 08:40:45 -0700

We've been evolving from type 1 you describe to a pull/poll version of the
type 2 you describe. For type 1, it is really hard to tell what's going on
(all the UI views become useless because they are so huge). Having one big
dag also means you can't turn off the scheduler for individual parts, and
the whole DAG fails if one task does, so if you can functionally separate
them I think that gives you more configuration options. Our biggest DAG now
is more like 22*10 tasks, which is still too big in our opinions. We
leverage ExternalTaskSensors to link dags together which is more of a
pull/poll paradigm, but you could use a TriggerDagRunOperator if you wanted
more of a push/trigger paradigm which is what I hea ryou saying in type 2.

To your second question, our DAGs are dynamic based on the results of an
API call we embed in the DAG and our scheduler is on a 5-second timelapse
for each attemp to refill the DagBag. I think because of the frequency of
the scheduler polling the files, because our API call is relatively fast,
we are working with DAGs that are on a 24 hour schedule_interval, and the
resultant DAG structure is not too large or complicated, we haven't had any
issues with that or done anything special. I think it's just the fact of
the matter that if you give the scheduler a lot of work to do to determine
the DAG shape, it will take a while.

Laura

On Fri, Oct 21, 2016 at 10:52 AM, Boris Tyukin <bo...@boristyukin.com>
wrote:

> Guys, would you mind to chime in and share your experience?
>

Re: Best practices for dynamically generated tasks and dags

Reply via email to