Hi,
This is a neat use case - glad you’re using Airflow for it!
Out of curiosity, why don’t you create a single dag, call it 
company_models_dag, and register each task to this dag. So for each company you 
have a spark task to build a model. You can do this programmatically, to loop 
through each company (you probably have a list of that). This way if one 
company model fails you can just rerun that, you don’t have to rerun entire dag 
- you pick up from where it left off as in the failed task and it’s downstream.
Some things to consider are you’ll want to limit the concurrency (unless you 
can indeed run 1000 tasks in your spark cluster at once). 

Sent from my iPhone

> On Mar 21, 2018, at 10:34 AM, Kyle Hamlin <[email protected]> wrote:
> 
> Hello,
> 
> I'm currently using Airflow for some ETL tasks where I submit a spark job
> to a cluster and poll till it is complete. This workflow is nice because it
> is typically a single Dag. I'm now starting to do more machine learning
> tasks and need to build a model per client which is 1000+ clients. My
> spark cluster is capable of handling this workload, however, it doesn't
> seem scalable to write 1000+ dags to fit models for each client. I want
> each client to have its own task instance so it can be retried if it
> fails without having to run all 1000+ tasks over again. How do I handle
> this type of workflow in Airflow?

Reply via email to