Hi Ashish, Partitioned tasks might be able to be modeled as triggered, n-many parameterized dags/subdags (where the parameter is the partition key). I've used this pattern in the past a lot in other systems, but not specifically with airflow, so I'm not sure how you'd necessarily implement it with Airflow, but hoping this maybe gives you some ideas.
Brian > On Aug 9, 2017, at 10:23 AM, Ashish Rawat <[email protected]> wrote: > > Yes, I believe they are used for splitting a bigger DAG into smaller DAGs, > for clarity and reusability. In our use case, we need to split/replicate a > specific task into multiple tasks, based on the different values of a key, > essentially data partitioning and processing. > > -- > Regards, > Ashish > > > >> On 09-Aug-2017, at 10:49 PM, Van Klaveren, Brian N. <[email protected]> >> wrote: >> >> Have you looked into subdags? >> >> Brian >> >> >>> On Aug 9, 2017, at 10:16 AM, Ashish Rawat <[email protected]> wrote: >>> >>> Thanks George. Our use case also periodic scheduling (daily), as well as >>> task dependencies, so we chose Airflow for this use case. However, some of >>> the tasks in a DAG have now become too big to execute over one node, we >>> want to split them into multiple task to reduce execution time. Would you >>> recommend firing parts of an Airflow DAG in another framework? >>> >>> -- >>> Regards, >>> Ashish >>> >>> >>> >>>> On 09-Aug-2017, at 10:40 PM, George Leslie-Waksman >>>> <[email protected]> wrote: >>>> >>>> Airflow is best for situations where you want to run different tasks that >>>> depend on each other or process data that arrives over time. If your goal >>>> is to take a large dataset, split it up, and process chunks of it, there >>>> are probably other tools better suited to your purpose. >>>> >>>> Off the top of my head, you might consider Dask: >>>> https://dask.pydata.org/en/latest/ or directly using Celery: >>>> http://www.celeryproject.org/ >>>> >>>> --George >>>> >>>> On Wed, Aug 9, 2017 at 9:52 AM Ashish Rawat <[email protected]> >>>> wrote: >>>> >>>>> Hi - Can anyone please provide some pointers for this use case over >>>>> Airflow? >>>>> >>>>> -- >>>>> Regards, >>>>> Ashish >>>>> >>>>> >>>>> >>>>>> On 03-Aug-2017, at 9:13 PM, Ashish Rawat <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> We have a use case where we are running some R/Python based data science >>>>> models, which execute over a single node. The execution time of the models >>>>> is constantly increasing and we are now planning to split the model >>>>> training by a partition key and distribute the workload over multiple >>>>> machines. >>>>>> >>>>>> Does Airflow provide some simple way to split a task into multiple >>>>> tasks, all of which will work on a specific value of the key. >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Ashish >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>> >> >
