Kyle, We have a similar approach but on a much, much smaller scale. We now have <100 “things to process” but expect it to grow to under ~200. Each “thing to process” has the same workflow so we have a single DAG definition that does about 20 tasks per, then we loop over the list of items and produce a dag object for each one adding it to the global definition.
One of the things we quickly ran into was crushing the scheduler as everything was running with the same start time. To get around this we add noise to the start time minute and seconds. Simply index % 60. This spreads out the load so that the scheduler isn’t trying to run everything at the exact same moment. I would suggest if you do go this route, to also stagger your hours if you can because of how many you plan to run. Perhaps your DAGs are smaller and aren’t as CPU intensive as ours. On 3/21/18, 1:35 PM, "Kyle Hamlin" <[email protected]> wrote: Hello, I'm currently using Airflow for some ETL tasks where I submit a spark job to a cluster and poll till it is complete. This workflow is nice because it is typically a single Dag. I'm now starting to do more machine learning tasks and need to build a model per client which is 1000+ clients. My spark cluster is capable of handling this workload, however, it doesn't seem scalable to write 1000+ dags to fit models for each client. I want each client to have its own task instance so it can be retried if it fails without having to run all 1000+ tasks over again. How do I handle this type of workflow in Airflow?
