We also use a similar approach to generate dynamic DAGs based on a common
template DAG file.  We pull in the list of config objects, one per DAG,
from an internal API lightly wrapping the database, then we cache that
response in a Airflow Variable that gets updated once a minute.  The
dynamic DAGs are generated from that variable.

*Taylor Edmiston*
TEdmiston.com <https://www.tedmiston.com/> | Blog
<http://blog.tedmiston.com>
Stack Overflow CV <https://stackoverflow.com/story/taylor> | LinkedIn
<https://www.linkedin.com/in/tedmiston/> | AngelList
<https://angel.co/taylor>


On Wed, Mar 21, 2018 at 1:54 PM, Nicolas Kijak <
[email protected]> wrote:

> Kyle,
>
> We have a similar approach but on a much, much smaller scale. We now have
> <100 “things to process” but expect it to grow to under ~200.  Each “thing
> to process” has the same workflow so we have a single DAG definition that
> does about 20 tasks per, then we loop over the list of items and produce a
> dag object for each one adding it to the global definition.
>
> One of the things we quickly ran into was crushing the scheduler as
> everything was running with the same start time.  To get around this we add
> noise to the start time minute and seconds. Simply index % 60.  This
> spreads out the load so that the scheduler isn’t trying to run everything
> at the exact same moment.  I would suggest if you do go this route, to also
> stagger your hours if you can because of how many you plan to run.  Perhaps
> your DAGs are smaller and aren’t as CPU intensive as ours.
>
> On 3/21/18, 1:35 PM, "Kyle Hamlin" <[email protected]> wrote:
>
>     Hello,
>
>     I'm currently using Airflow for some ETL tasks where I submit a spark
> job
>     to a cluster and poll till it is complete. This workflow is nice
> because it
>     is typically a single Dag. I'm now starting to do more machine learning
>     tasks and need to build a model per client which is 1000+ clients. My
>     spark cluster is capable of handling this workload, however, it doesn't
>     seem scalable to write 1000+ dags to fit models for each client. I want
>     each client to have its own task instance so it can be retried if it
>     fails without having to run all 1000+ tasks over again. How do I handle
>     this type of workflow in Airflow?
>
>
>

Reply via email to