Yes. Our batch processing back-end for the Resource manager now can take advantage of the Mesos Cluster manager.
This enables OODT to farm batch processing out to a mesos cluster. -Michael On Thu, Mar 12, 2015 at 7:58 AM, BW <[email protected]> wrote: > Any thoughts on integrating a plug in service with Marathon first then > layer Mesos on top? > > On Wednesday, March 11, 2015, Mattmann, Chris A (3980) < > [email protected]> wrote: > > > Apache OODT now has a workflow plugin that connects to Mesos: > > > > http://oodt.apache.org/ > > > > Cross posting this to [email protected] <javascript:;> so people like > > Mike Starch can chime in. > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Chief Architect > > Instrument Software and Science Data Systems Section (398) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: [email protected] <javascript:;> > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > > > > -----Original Message----- > > From: Zameer Manji <[email protected] <javascript:;>> > > Reply-To: "[email protected] <javascript:;>" > > <[email protected] <javascript:;>> > > Date: Wednesday, March 11, 2015 at 3:21 PM > > To: "[email protected] <javascript:;>" < > > [email protected] <javascript:;>> > > Subject: Re: Data processing pipeline workflow management > > > > >Hey, > > > > > >This is a great question. See my comments inline below. > > > > > >On Tue, Mar 10, 2015 at 8:28 AM, Lars Albertsson > > ><[email protected] <javascript:;>> > > >wrote: > > > > > >> We are evaluating Aurora as a workflow management tool for batch > > >> processing pipelines. We basically need a tool that regularly runs > > >> batch processes that are connected as producers/consumers of data, > > >> typically stored in HDFS or S3. > > >> > > >> The alternative tools would be Azkaban, Luigi, and Oozie, but I am > > >> hoping that building something built on Aurora would result in a > > >> better solution. > > >> > > >> Does anyone have experience with building workflows with Aurora? How > > >> is Twitter handling batch pipelines? Would the approach below make > > >> sense, or are there better suggestions? Is there anything related to > > >> this in the roadmap or available inside Twitter only? > > >> > > > > > >As far as I know, you are the first person to consider Aurora for > workflow > > >management for batch processing. Currently Twitter does not use Aurora > for > > >batch pipelines. > > >I'm not aware of the specifics of the design, but at Twitter there is an > > >internal solution for pipelines built upon Hadoop/YARN. > > >Currently Aurora is designed around being a service scheduler and I'm > not > > >aware of any future plans to support workflows or batch computation. > > > > > > > > >> In our case, the batch processes will be a mix of cluster > > >> computation's with Spark, and single-node computations. We want the > > >> latter to also be scheduled on a farm, and this is why we are > > >> attracted to Mesos. In the text below, I'll call each part of a > > >> pipeline a 'step', in order to avoid confusion with Aurora jobs and > > >> tasks. > > >> > > >> My unordered wishlist is: > > >> * Data pipelines consist of DAGs, where steps take one or more inputs, > > >> and generate one or more outputs. > > >> > > >> * Independent steps in the DAG execute in parallel, constrained by > > >> resources. > > >> > > >> * Steps can be written in different languages and frameworks, some > > >> clustered. > > >> > > >> * The developer code/test/debug cycle is quick, and all functional > > >> tests can execute on a laptop. > > >> > > >> * Developers can test integrated data pipelines, consisting of > > >> multiple steps, on laptops. > > >> > > >> * Steps and their intputs and outputs are parameterised, e.g. by date. > > >> A parameterised step is typically independent from other instances of > > >> the same step, e.g. join one day's impressions log with user > > >> demographics. In some cases, steps depend on yesterday's results, e.g. > > >> apply one day's user management operation log to the user dataset from > > >> the day before. > > >> > > >> * Data pipelines are specified in embedded DSL files (e.g. aurora > > >> files), kept close to the business logic code. > > >> > > >> * Batch steps should be started soon after the input files become > > >> available. > > >> > > >> * Steps should gracefully avoid recomputation when output files exist. > > >> > > >> * Backfilling a window back in time, e.g. 30 days, should happen > > >> automatically if some earlier steps have failed, or if output files > > >> have been deleted manually. > > >> > > >> * Continuous deployment in the sense that steps are automatically > > >> deployed and scheduled after 'git push'. > > >> > > >> * Step owners can get an overview of step status and history, and > > >> debug step execution, e.g. by accessing log files. > > >> > > >> > > >> I am aware that no framework will give us everything. It is a matter > > >> of how much we need to live without or build ourselves. > > >> > > > > > >Your wishlist looks pretty reasonable for batch computation workflows. > > > > > >I'm not aware of any batch/workflow Mesos framework. If you want some or > > >all of the above features on top of Mesos, I think you would be > venturing > > >into writing your own framework. > > >Aurora doesn't have the concept of DAG and it can't make scheduling > > >decisions based on job progress or HDFS state. > > > > > >-- > > >Zameer Manji > > > > >
