Thanks Sebastian. The scope includes allowing for a complex DAG within the same 'job' and, as such, it generalizes MapReduce to look more like Stratosphere/Hyracks. The goal is to help better Hive/Pig/Cascading/Crunch etc.
Hope that helps. thanks, Arun On Feb 19, 2013, at 1:23 AM, Sebastian Schelter wrote: > Hi, > > This proposal looks very interesting to me. What exactly is the scope of > Tez? Does it aim to be a general data flow system such as > Stratosphere[1] or Hyracks[2]? Or will it still be executing Map and > Reduce tasks, that are composable in a more flexible manner? > > Best, > Sebastian > > [1] http://dl.acm.org/citation.cfm?id=1807148 > https://www.stratosphere.eu/sites/default/files/papers/NephelePACTs_10.pdf > > [2] > http://dl.acm.org/citation.cfm?id=2005632 > http://asterix.ics.uci.edu/pub/Hyracks.pdf > > On 19.02.2013 09:53, Avik Dey wrote: >> The Tez incubator proposal seems to have a lot in common with the work on >> https://issues.apache.org/jira/browse/OOZIE-1178 >> >>> It is useful to have a workflow application master, which will be capable >>> of running a DAG of jobs. The workflow client submits a DAG request to the >>> AM and then the AM will manage the life cycle of this application in terms >>> of requesting the needed resources from the RM, and starting, monitoring >>> and retrying the application's individual tasks. >>> >>> Compared to running Oozie with the current MapReduce Application Master, >>> these are some of the advantages: >>> >>> - Less number of consumed resources, since only one application master >>> will be spawned for the whole workflow. >>> - Reuse of resources, since the same resources can be used by multiple >>> consecutive jobs in the workflow (no need to request/wait for resources >>> for >>> every individual job from the central RM). >>> - More optimization opportunities in terms of collective resource >>> requests. >>> - Optimization opportunities in terms of rewriting and composing jobs >>> in the workflow (e.g. pushing down Mappers). >>> - This Application Master can be reused/extended by higher systems >>> like Pig and hive to provide an optimized way of running their workflows. >>> >>> So, is this the 'yapp' proposal that was discussed on that thread? >> >> ~avik >> >> >> On Mon, Feb 18, 2013 at 9:40 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: >> >>> This seems like a reasonable project (basically it is the long fabled >>> map-reduce-reduce or MCR* in google terminology). >>> >>> But it is *very* heavy with Hortonworks developers. By my count, the >>> proportion is over half from HW with only token representation from other >>> companies: >>> >>> 13 Hortonworks >>> 4 Yahoo >>> 3 Facebook >>> 2 Microsoft >>> 1 Cloudera >>> >>> Shouldn't this be a bit broader to start with? Or is that an incubation >>> task? >>> >>> On Mon, Feb 18, 2013 at 9:29 PM, Arun C Murthy <a...@hortonworks.com> >>> wrote: >>> >>>> Folks, >>>> >>>> I'd like to propose adding Tez to the Apache Incubator: >>>> http://wiki.apache.org/incubator/TezProposal >>>> >>>> Essentially, it's the next step to improve projects in the Apache Hadoop >>>> ecosystem such as Apache Hive, Apache Pig, Cascading (ASL2, but not ASF >>>> project) by providing a more complex DAG of 'tasks' in a single >>> application >>>> to process data, there-by providing significant advantages for them. >>>> >>>> During the time I've spent working on MapReduce, I've forever heard >>>> complaints from Pig/Hive folks about the fact that MapReduce provides a >>>> very constrained task graph which results in excessive number of >>> MapReduce >>>> jobs... *smile*. It's very exciting to take this next step, and I would >>> be >>>> thrilled to have it happen in the ASF - as you can see in the proposal >>> this >>>> effort has broad support from members of MapReduce, Hive & Pig >>> communities, >>>> many of whom are eager to participate and have already contributed their >>>> efforts during the initial prototype. >>>> >>>> I welcome your feedback/discussion and look forward to it! >>>> >>>> thanks, >>>> Arun >>>> (proposed Champion) >>>> >>>> ---- >>>> >>>> = Tez = >>>> >>>> == Abstract == >>>> Tez is an effort to develop a generic application framework which can be >>>> used >>>> to process arbitrarily complex data-processing tasks and also a re-usable >>>> set >>>> of data-processing primitives which can be used by other projects. >>>> >>>> == Proposal == >>>> Tez is a proposal to develop a generic application which can be used to >>>> process complex data-processing task DAGs and runs natively on Apache >>>> Hadoop >>>> YARN. YARN is a generic resource-management system on which currently >>>> applications like MapReduce already exist. MapReduce is a specific, and >>>> constrained, DAG - which is not optimal for several frameworks like >>> Apache >>>> Hive >>>> and Apache Pig. Furthermore, we propose to develop a re-usable set of >>>> libraries of data-processing primitives such as sorting, merging, >>>> data-shuffling, intermediate data management etc. which are necessary for >>>> Tez >>>> which we envision can be used directly by other projects. >>>> >>>> == Background == >>>> Apache Hadoop MapReduce has emerged as the assembly-language on which >>> other >>>> frameworks like Apache Pig and Apache Hive have been built. However, it >>> has >>>> been well accepted that MapReduce produces very constrained task DAGs for >>>> each >>>> job which results in Apache Pig and Apache Hive requiring multiple >>>> MapReduce >>>> jobs for several queries. By providing a more expressive DAG of tasks >>> for a >>>> job, Tez attempts to provide significantly enhanced data-processing >>>> capabilities for projects like Apache Pig, Apache Hive, Cascading etc. >>>> >>>> == Rationale == >>>> There is an important gap that Tez fulfills in the Apache Hadoop >>> ecosystem >>>> of >>>> allowing for more expressive task DAGs for data-processing applications >>>> such >>>> as Apache Pig, Apache Hive, Cascading etc. >>>> >>>> With emergence of Apache Hadoop YARN, there is a strong need for a >>>> common DAG application which can then be shared by Apache Pig, Apache >>> Hive, >>>> Cascading etc. >>>> >>>> == Initial Goals == >>>> The initial goals for this project are to specify the detailed >>> requirements >>>> and architecture, and then develop the initial implementation including >>> the >>>> DAG ApplicationMaster to run natively inside Apache Hadoop YARN. >>>> >>>> == Current Status == >>>> Significant work has been completed to identify the initial requirements >>>> and >>>> define the overall system architecture. There is a patch available in the >>>> internal Hortonworks git repository which can act as the initial seed. >>>> >>>> === Meritocracy === >>>> We plan to invest in supporting a meritocracy. We will discuss the >>>> requirements >>>> in an open forum. Several companies have already expressed interest in >>> this >>>> project, and we intend to invite additional developers to participate. >>>> We will encourage and monitor community participation so that privileges >>>> can be >>>> extended to those that contribute. >>>> >>>> === Community === >>>> The need for a generic DAG application for data processing in the open >>>> source is >>>> tremendous, so there is a potential for a very large community. We >>> believe >>>> that Tez's extensible architecture will further encourage community >>>> participation. >>>> Also, related Apache projects (eg, Pig, Hive) have very large and active >>>> communities, and we expect that over time Tez will also attract a large >>>> community. >>>> >>>> === Core Developers === >>>> The developers on the initial committers list include people very >>>> experienced >>>> in the Apache Hadoop ecosystem: >>>> >>>> * Alan Gates <gates at apache dot org> >>>> * Arun C Murthy <acmurthy at apache dot org> >>>> * Ashutosh Chauhan <hashutosh at apache dot org> >>>> * Bikas Saha <bikas at apache dot org> >>>> * Chris Douglas <cdouglas at apache dot org> >>>> * Daryn Sharp <daryn at apache dot org> >>>> * Devaraj Das <ddas at apache dot org> >>>> * Gopal Vijayaraghavan <gopal at hortonworks dot com> >>>> * Gunther Hagleitner <ghagleitner at hortonworks dot com> >>>> * Hitesh Shah <hitesh at apache dot org> >>>> * Jason Lowe <jlowe at apache dot org> >>>> * Jean Xu <jeanxu at facebook dot com> >>>> * Jitendra Pandey <jitendra at apache dot org> >>>> * Kevin Wilfong <kevinwilfong at apache dot org> >>>> * Mike Liddell <mike dot lidell at microsoft dot com> >>>> * Namit Jain <namit at apache dot org> >>>> * Owen O'Malley <omalley at apache dot org> >>>> * Robert Evans <bobby at apache dot org> >>>> * Siddharth Seth <sseth at apache dot org> >>>> * Tom White <tomwhite at apache dot org> >>>> * Thomas Graves <tgraves at apache dot org> >>>> * Vikram Dixit <vikram at apache dot org> >>>> * Vinod Kumar Vavilapalli <vinodkv at apache dot org> >>>> >>>> We realize that though we have significant employer diversity already, >>>> additional diversity is always better, and we will work >>>> aggressively to recruit developers from additional companies. >>>> >>>> === Alignment === >>>> The initial committers strongly believe that a standard task DAG >>>> application on Apache Hadoop YARN will gain broader adoption as an open >>>> source, >>>> community driven project, where the community can contribute not only to >>>> the >>>> core components, but also to a growing collection of applications which >>>> will >>>> be based on top of Tez. Our hope is that the Apache Hive, Apache Pig, >>>> Cascading and other communities will find tremendous value in Tez and >>> will >>>> adopt >>>> it en masse. >>>> >>>> == Known Risks == >>>> >>>> === Orphaned Products === >>>> The contributors are leading users and vendors in the Apache Hadoop >>>> ecosystem, >>>> with significant open source experience, so the risk of being orphaned is >>>> relatively low. The project could be at risk if vendors decided to change >>>> their strategies in the market. In such an event, the current committers >>>> plan to continue working on the project on their own time, though the >>>> progress will likely be slower. We plan to mitigate this risk by >>>> recruiting additional committers. >>>> >>>> === Inexperience with Open Source === >>>> The initial committers include veteran Apache members (Committers, PMC >>>> members >>>> and Apache Members) and other developers who have varying degrees of >>>> experience >>>> with open source projects. All have been involved with source code that >>> has >>>> been released under an open source license, and several also have >>>> experience >>>> developing code with an open source development process. >>>> >>>> === Homogenous Developers === >>>> The initial committers are employed by a number of companies, including >>>> Cloudera, Facebook, Hortonworks, Microsoft and Yahoo. We are committed to >>>> recruiting additional committers from other companies based on their >>>> contributions to the project even though we do have significant diversity >>>> already. >>>> >>>> === Reliance on Salaried Developers === >>>> It is expected that Tez development will occur on both salaried time and >>> on >>>> volunteer time, after hours. The majority of initial committers are paid >>> by >>>> their employer to contribute to this project. However, they are all >>>> passionate >>>> about the project, and we are confident that the project will continue >>>> even if >>>> no salaried developers contribute to the project. We are committed to >>>> recruiting >>>> additional committers including non-salaried developers. >>>> >>>> === Relationships with Other Apache Products === >>>> As mentioned in the Alignment section, Tez is closely integrated with >>>> Hadoop, >>>> Hive and Pig in a numerous ways. We look forward to collaborating with >>>> those communities, as well as other Apache communities. >>>> >>>> === An Excessive Fascination with the Apache Brand === >>>> Tez solves a real need for generic task DAG management in the Apache >>> Hadoop >>>> ecosystem, something which has been addressed in a very ad hoc manner so >>>> far >>>> by multiple Apache projects. Our rationale for developing Tez as an >>> Apache >>>> project is detailed in the Rationale section. We believe that the Apache >>>> brand >>>> and community process will help us attract more contributors to this >>>> project, >>>> and help establish ubiquitous APIs. >>>> >>>> == Documentation == >>>> http://wiki.apache.org/incubator/TezProposal >>>> >>>> == Initial Source == >>>> Available as a patch. >>>> >>>> == Cryptography == >>>> Tez will eventually support encryption on the wire. This is not one of >>> the >>>> initial >>>> goals, and we do not expect Tez to be a controlled export item due to the >>>> use >>>> of encryption. >>>> >>>> == Required Resources == >>>> >>>> === Mailing List === >>>> * tez-private >>>> * tez-dev >>>> * tez-user >>>> >>>> === Subversion Directory === >>>> Git is the preferred source control system: git://git.apache.org/tez >>>> >>>> === Issue Tracking === >>>> >>>> JIRA Tez (TEZ) >>>> >>>> == Initial Committers == >>>> * Alan Gates <gates at apache dot org> >>>> * Arun C Murthy <acmurthy at apache dot org> >>>> * Ashutosh Chauhan <hashutosh at apache dot org> >>>> * Bikas Saha <bikas at apache dot org> >>>> * Chris Douglas <cdouglas at apache dot org> >>>> * Daryn Sharp <daryn at apache dot org> >>>> * Devaraj Das <ddas at apache dot org> >>>> * Gopal Vijayaraghavan <gopal at hortonworks dot com> >>>> * Gunther Hagleitner <ghagleitner at hortonworks dot com> >>>> * Hitesh Shah <hitesh at apache dot org> >>>> * Jason Lowe <jlowe at apache dot org> >>>> * Jean Xu <jeanxu at facebook dot com> >>>> * Jitendra Pandey <jitendra at apache dot org> >>>> * Kevin Wilfong <kevinwilfong at apache dot org> >>>> * Mike Liddell <mike dot lidell at microsoft dot com> >>>> * Namit Jain <namit at apache dot org> >>>> * Owen O'Malley <omalley at apache dot org> >>>> * Robert Evans <bobby at apache dot org> >>>> * Siddharth Seth <sseth at apache dot org> >>>> * Tom White <tomwhite at apache dot org> >>>> * Thomas Graves <tgraves at apache dot org> >>>> * Vikram Dixit <vikram at apache dot org> >>>> * Vinod Kumar Vavilapalli <vinodkv at apache dot org> >>>> >>>> == Affiliations == >>>> The initial committers are employees of Cloudera, Facebook, Hortonworks, >>>> Microsoft and Yahoo Inc. >>>> >>>> * Alan Gates - Hortonworks >>>> * Arun C Murthy - Hortonworks >>>> * Ashutosh Chauhan - Hortonworks >>>> * Bikas Saha - Hortonworks >>>> * Chris Douglas - Microsoft >>>> * Daryn Sharp - Yahoo >>>> * Devaraj Das - Hortonworks >>>> * Gopal Vijayaraghavan - Hortonworks >>>> * Gunther Hagleitner - Hortonworks >>>> * Hitesh Shah - Hortonworks >>>> * Jason Lowe - Yahoo >>>> * Jean Xu - Facebook >>>> * Jitendra Pandey - Hortonworks >>>> * Kevin Wilfong - Facebook >>>> * Mike Liddell - Microsoft >>>> * Namit Jain - Facebook >>>> * Owen O'Malley - Hortonworks >>>> * Robert Evans - Yahoo >>>> * Siddharth Seth - Hortonworks >>>> * Tom White - Cloudera >>>> * Thomas Graves - Yahoo >>>> * Vikram Dixit - Hortonworks >>>> * Vinod Kumar Vavilapalli - Hortonworks >>>> >>>> The nominated mentors are employees of Hortonworks, >>>> NASA JPL and Microsoft. >>>> >>>> * Alan Gates - Hortonworks >>>> * Arun C Murthy - Hortonworks >>>> * Chris Douglas - Microsoft >>>> * Chris Mattman - NASA JPL >>>> * Owen O'Malley - Hortonworks >>>> >>>> == Sponsors == >>>> >>>> === Champion === >>>> Arun C Murthy <acmurthy at apache dot org> >>>> >>>> === Nominated Mentors === >>>> * Alan Gates <gates at apache dot org> – Architect at Hortonworks. >>>> Committer for Pig. >>>> * Arun C Murthy <acmurthy at apache dot org> – Architect at >>>> Hortonworks. Committer for Hadoop. >>>> * Chris Douglas <cdouglas at apache dot org> - Sr. Research Engineer at >>>> Microsoft. Committer for Hadoop. >>>> * Chris Mattman <mattmann at apache dot org> - Sr. Computer Scientist, >>>> NASA JPL. Committer for Nutch, OODT and Tika. >>>> * Owen O'Malley <omalley at apache dot org> – Architect at >>> Hortonworks. >>>> Committer for Hadoop, Ambari. >>>> >>>> === Sponsoring Entity === >>>> Incubator >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >>>> For additional commands, e-mail: general-h...@incubator.apache.org >>>> >>>> >>> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/