Kyle,

We have a similar approach but on a much, much smaller scale. We now have <100 
“things to process” but expect it to grow to under ~200.  Each “thing to 
process” has the same workflow so we have a single DAG definition that does 
about 20 tasks per, then we loop over the list of items and produce a dag 
object for each one adding it to the global definition.

One of the things we quickly ran into was crushing the scheduler as everything 
was running with the same start time.  To get around this we add noise to the 
start time minute and seconds. Simply index % 60.  This spreads out the load so 
that the scheduler isn’t trying to run everything at the exact same moment.  I 
would suggest if you do go this route, to also stagger your hours if you can 
because of how many you plan to run.  Perhaps your DAGs are smaller and aren’t 
as CPU intensive as ours.

On 3/21/18, 1:35 PM, "Kyle Hamlin" <[email protected]> wrote:

    Hello,
    
    I'm currently using Airflow for some ETL tasks where I submit a spark job
    to a cluster and poll till it is complete. This workflow is nice because it
    is typically a single Dag. I'm now starting to do more machine learning
    tasks and need to build a model per client which is 1000+ clients. My
    spark cluster is capable of handling this workload, however, it doesn't
    seem scalable to write 1000+ dags to fit models for each client. I want
    each client to have its own task instance so it can be retried if it
    fails without having to run all 1000+ tasks over again. How do I handle
    this type of workflow in Airflow?
    

Reply via email to