Re: Data processing pipeline workflow management

Michael Starch Thu, 12 Mar 2015 08:52:47 -0700

Yes.  Our batch processing back-end for the Resource manager now can take
advantage of the Mesos Cluster manager.


This enables OODT to farm batch processing out to a mesos cluster.

-Michael

On Thu, Mar 12, 2015 at 7:58 AM, BW <[email protected]> wrote:

> Any thoughts on integrating a plug in service with Marathon first then
> layer Mesos on top?
>
> On Wednesday, March 11, 2015, Mattmann, Chris A (3980) <
> [email protected]> wrote:
>
> > Apache OODT now has a workflow plugin that connects to Mesos:
> >
> > http://oodt.apache.org/
> >
> > Cross posting this to [email protected] <javascript:;> so people like
> > Mike Starch can chime in.
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: [email protected] <javascript:;>
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Zameer Manji <[email protected] <javascript:;>>
> > Reply-To: "[email protected] <javascript:;>"
> > <[email protected] <javascript:;>>
> > Date: Wednesday, March 11, 2015 at 3:21 PM
> > To: "[email protected] <javascript:;>" <
> > [email protected] <javascript:;>>
> > Subject: Re: Data processing pipeline workflow management
> >
> > >Hey,
> > >
> > >This is a great question. See my comments inline below.
> > >
> > >On Tue, Mar 10, 2015 at 8:28 AM, Lars Albertsson
> > ><[email protected] <javascript:;>>
> > >wrote:
> > >
> > >> We are evaluating Aurora as a workflow management tool for batch
> > >> processing pipelines. We basically need a tool that regularly runs
> > >> batch processes that are connected as producers/consumers of data,
> > >> typically stored in HDFS or S3.
> > >>
> > >> The alternative tools would be Azkaban, Luigi, and Oozie, but I am
> > >> hoping that building something built on Aurora would result in a
> > >> better solution.
> > >>
> > >> Does anyone have experience with building workflows with Aurora? How
> > >> is Twitter handling batch pipelines? Would the approach below make
> > >> sense, or are there better suggestions? Is there anything related to
> > >> this in the roadmap or available inside Twitter only?
> > >>
> > >
> > >As far as I know, you are the first person to consider Aurora for
> workflow
> > >management for batch processing. Currently Twitter does not use Aurora
> for
> > >batch pipelines.
> > >I'm not aware of the specifics of the design, but at Twitter there is an
> > >internal solution for pipelines built upon Hadoop/YARN.
> > >Currently Aurora is designed around being a service scheduler and I'm
> not
> > >aware of any future plans to support workflows or batch computation.
> > >
> > >
> > >> In our case, the batch processes will be a mix of cluster
> > >> computation's with Spark, and single-node computations. We want the
> > >> latter to also be scheduled on a farm, and this is why we are
> > >> attracted to Mesos. In the text below, I'll call each part of a
> > >> pipeline a 'step', in order to avoid confusion with Aurora jobs and
> > >> tasks.
> > >>
> > >> My unordered wishlist is:
> > >> * Data pipelines consist of DAGs, where steps take one or more inputs,
> > >> and generate one or more outputs.
> > >>
> > >> * Independent steps in the DAG execute in parallel, constrained by
> > >> resources.
> > >>
> > >> * Steps can be written in different languages and frameworks, some
> > >> clustered.
> > >>
> > >> * The developer code/test/debug cycle is quick, and all functional
> > >> tests can execute on a laptop.
> > >>
> > >> * Developers can test integrated data pipelines, consisting of
> > >> multiple steps, on laptops.
> > >>
> > >> * Steps and their intputs and outputs are parameterised, e.g. by date.
> > >> A parameterised step is typically independent from other instances of
> > >> the same step, e.g. join one day's impressions log with user
> > >> demographics. In some cases, steps depend on yesterday's results, e.g.
> > >> apply one day's user management operation log to the user dataset from
> > >> the day before.
> > >>
> > >> * Data pipelines are specified in embedded DSL files (e.g. aurora
> > >> files), kept close to the business logic code.
> > >>
> > >> * Batch steps should be started soon after the input files become
> > >> available.
> > >>
> > >> * Steps should gracefully avoid recomputation when output files exist.
> > >>
> > >> * Backfilling a window back in time, e.g. 30 days, should happen
> > >> automatically if some earlier steps have failed, or if output files
> > >> have been deleted manually.
> > >>
> > >> * Continuous deployment in the sense that steps are automatically
> > >> deployed and scheduled after 'git push'.
> > >>
> > >> * Step owners can get an overview of step status and history, and
> > >> debug step execution, e.g. by accessing log files.
> > >>
> > >>
> > >> I am aware that no framework will give us everything. It is a matter
> > >> of how much we need to live without or build ourselves.
> > >>
> > >
> > >Your wishlist looks pretty reasonable for batch computation workflows.
> > >
> > >I'm not aware of any batch/workflow Mesos framework. If you want some or
> > >all of the above features on top of Mesos, I think you would be
> venturing
> > >into writing your own framework.
> > >Aurora doesn't have the concept of DAG and it can't make scheduling
> > >decisions based on job progress or HDFS state.
> > >
> > >--
> > >Zameer Manji
> >
> >
>

Re: Data processing pipeline workflow management

Reply via email to