We also use a similar approach to generate dynamic DAGs based on a common template DAG file. We pull in the list of config objects, one per DAG, from an internal API lightly wrapping the database, then we cache that response in a Airflow Variable that gets updated once a minute. The dynamic DAGs are generated from that variable.
*Taylor Edmiston* TEdmiston.com <https://www.tedmiston.com/> | Blog <http://blog.tedmiston.com> Stack Overflow CV <https://stackoverflow.com/story/taylor> | LinkedIn <https://www.linkedin.com/in/tedmiston/> | AngelList <https://angel.co/taylor> On Wed, Mar 21, 2018 at 1:54 PM, Nicolas Kijak < [email protected]> wrote: > Kyle, > > We have a similar approach but on a much, much smaller scale. We now have > <100 “things to process” but expect it to grow to under ~200. Each “thing > to process” has the same workflow so we have a single DAG definition that > does about 20 tasks per, then we loop over the list of items and produce a > dag object for each one adding it to the global definition. > > One of the things we quickly ran into was crushing the scheduler as > everything was running with the same start time. To get around this we add > noise to the start time minute and seconds. Simply index % 60. This > spreads out the load so that the scheduler isn’t trying to run everything > at the exact same moment. I would suggest if you do go this route, to also > stagger your hours if you can because of how many you plan to run. Perhaps > your DAGs are smaller and aren’t as CPU intensive as ours. > > On 3/21/18, 1:35 PM, "Kyle Hamlin" <[email protected]> wrote: > > Hello, > > I'm currently using Airflow for some ETL tasks where I submit a spark > job > to a cluster and poll till it is complete. This workflow is nice > because it > is typically a single Dag. I'm now starting to do more machine learning > tasks and need to build a model per client which is 1000+ clients. My > spark cluster is capable of handling this workload, however, it doesn't > seem scalable to write 1000+ dags to fit models for each client. I want > each client to have its own task instance so it can be retried if it > fails without having to run all 1000+ tasks over again. How do I handle > this type of workflow in Airflow? > > >
