Thanks Frances ! That explains it. Wrote a couple of posts on basic usage of Crunch, may be its time to rewrite them with Dataflow.
On Fri, Jan 22, 2016 at 10:58 AM, Frances Perry <f...@google.com.invalid> wrote: > Crunch started as a clone of FlumeJava, which was Google internal. In the > meantime inside Google, FlumeJava evolved into Dataflow. So all three share > a number of concepts like PCollections, ParDo, DoFn, etc. However, Dataflow > adds a number of new things -- the biggest being a unified batch/streaming > semantics using concepts like Windowing and Triggers. Tyler Akidau's > OReilly post has a really nice explanation: > https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 > > On Fri, Jan 22, 2016 at 10:42 AM, Ashish <paliwalash...@gmail.com> wrote: > >> Crunch has Spark pipelines, but not sure about the runner abstraction. >> >> May be Josh Wills or Tom White can provide more insight on this topic. >> They are core devs for both projects :) >> >> On Fri, Jan 22, 2016 at 9:47 AM, Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> > Hi, >> > >> > I don't know deeply Crunch, but AFAIK, Crunch creates MapReduce >> pipeline, it >> > doesn't provide runner abstraction. It's based on FlumeJava. >> > >> > The logic is very similar (with DoFns, pipelines, ...). Correct me if I'm >> > wrong, but Crunch started after Google Dataflow, especially because >> Dataflow >> > was not opensourced at that time. >> > >> > So, I agree it's very similar/close. >> > >> > Regards >> > JB >> > >> > >> > On 01/22/2016 05:51 PM, Ashish wrote: >> >> >> >> Hi JB, >> >> >> >> Curious to know about how it compares to Apache Crunch? Constructs >> >> looks very familiar (had used Crunch long ago) >> >> >> >> Thoughts? >> >> >> >> - Ashish >> >> >> >> On Fri, Jan 22, 2016 at 6:33 AM, Jean-Baptiste Onofré <j...@nanthrax.net> >> >> wrote: >> >>> >> >>> Hi Seshu, >> >>> >> >>> I blogged about Apache Dataflow proposal: >> >>> http://blog.nanthrax.net/2016/01/introducing-apache-dataflow/ >> >>> >> >>> You can see in the "what's next ?" section that new runners, skins and >> >>> sources are on our roadmap. Definitely, a storm runner could be part of >> >>> this. >> >>> >> >>> Regards >> >>> JB >> >>> >> >>> >> >>> On 01/22/2016 03:31 PM, Adunuthula, Seshu wrote: >> >>>> >> >>>> >> >>>> Awesome to see CloudDataFlow coming to Apache. The Stream Processing >> >>>> area >> >>>> has been in general fragmented with a variety of solutions, hoping the >> >>>> community galvanizes around Apache Data Flow. >> >>>> >> >>>> We are still in the "Apache Storm" world, Any chance for folks >> building >> >>>> a >> >>>> "Storm Runner²? >> >>>> >> >>>> >> >>>> On 1/20/16, 9:39 AM, "James Malone" <jamesmal...@google.com.INVALID> >> >>>> wrote: >> >>>> >> >>>>>> Great proposal. I like that your proposal includes a well presented >> >>>>>> roadmap, but I don't see any goals that directly address building a >> >>>>>> larger >> >>>>>> community. Y'all have any ideas around outreach that will help with >> >>>>>> adoption? >> >>>>>> >> >>>>> >> >>>>> Thank you and fair point. We have a few additional ideas which we can >> >>>>> put >> >>>>> into the Community section. >> >>>>> >> >>>>> >> >>>>>> >> >>>>>> As a start, I recommend y'all add a section to the proposal on the >> >>>>>> wiki >> >>>>>> page for "Additional Interested Contributors" so that folks who want >> >>>>>> to >> >>>>>> sign up to participate in the project can do so without requesting >> >>>>>> additions to the initial committer list. >> >>>>>> >> >>>>>> >> >>>>> This is a great idea and I think it makes a lot of sense to add an >> >>>>> "Additional >> >>>>> Interested Contributors" section to the proposal. >> >>>>> >> >>>>> >> >>>>>> On Wed, Jan 20, 2016 at 10:32 AM, James Malone < >> >>>>>> jamesmal...@google.com.invalid> wrote: >> >>>>>> >> >>>>>>> Hello everyone, >> >>>>>>> >> >>>>>>> Attached to this message is a proposed new project - Apache >> Dataflow, >> >>>>>> >> >>>>>> >> >>>>>> a >> >>>>>>> >> >>>>>>> >> >>>>>>> unified programming model for data processing and integration. >> >>>>>>> >> >>>>>>> The text of the proposal is included below. Additionally, the >> >>>>>> >> >>>>>> >> >>>>>> proposal is >> >>>>>>> >> >>>>>>> >> >>>>>>> in draft form on the wiki where we will make any required changes: >> >>>>>>> >> >>>>>>> https://wiki.apache.org/incubator/DataflowProposal >> >>>>>>> >> >>>>>>> We look forward to your feedback and input. >> >>>>>>> >> >>>>>>> Best, >> >>>>>>> >> >>>>>>> James >> >>>>>>> >> >>>>>>> ---- >> >>>>>>> >> >>>>>>> = Apache Dataflow = >> >>>>>>> >> >>>>>>> == Abstract == >> >>>>>>> >> >>>>>>> Dataflow is an open source, unified model and set of >> >>>>>>> language-specific >> >>>>>> >> >>>>>> >> >>>>>> SDKs >> >>>>>>> >> >>>>>>> >> >>>>>>> for defining and executing data processing workflows, and also data >> >>>>>>> ingestion and integration flows, supporting Enterprise Integration >> >>>>>> >> >>>>>> >> >>>>>> Patterns >> >>>>>>> >> >>>>>>> >> >>>>>>> (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines >> >>>>>> >> >>>>>> >> >>>>>> simplify >> >>>>>>> >> >>>>>>> >> >>>>>>> the mechanics of large-scale batch and streaming data processing >> and >> >>>>>> >> >>>>>> >> >>>>>> can >> >>>>>>> >> >>>>>>> >> >>>>>>> run on a number of runtimes like Apache Flink, Apache Spark, and >> >>>>>> >> >>>>>> >> >>>>>> Google >> >>>>>>> >> >>>>>>> >> >>>>>>> Cloud Dataflow (a cloud service). Dataflow also brings DSL in >> >>>>>> >> >>>>>> >> >>>>>> different >> >>>>>>> >> >>>>>>> >> >>>>>>> languages, allowing users to easily implement their data >> integration >> >>>>>>> processes. >> >>>>>>> >> >>>>>>> == Proposal == >> >>>>>>> >> >>>>>>> Dataflow is a simple, flexible, and powerful system for distributed >> >>>>>> >> >>>>>> >> >>>>>> data >> >>>>>>> >> >>>>>>> >> >>>>>>> processing at any scale. Dataflow provides a unified programming >> >>>>>> >> >>>>>> >> >>>>>> model, a >> >>>>>>> >> >>>>>>> >> >>>>>>> software development kit to define and construct data processing >> >>>>>> >> >>>>>> >> >>>>>> pipelines, >> >>>>>>> >> >>>>>>> >> >>>>>>> and runners to execute Dataflow pipelines in several runtime >> engines, >> >>>>>> >> >>>>>> >> >>>>>> like >> >>>>>>> >> >>>>>>> >> >>>>>>> Apache Spark, Apache Flink, or Google Cloud Dataflow. Dataflow can >> be >> >>>>>> >> >>>>>> >> >>>>>> used >> >>>>>>> >> >>>>>>> >> >>>>>>> for a variety of streaming or batch data processing goals including >> >>>>>> >> >>>>>> >> >>>>>> ETL, >> >>>>>>> >> >>>>>>> >> >>>>>>> stream analysis, and aggregate computation. The underlying >> >>>>>>> programming >> >>>>>>> model for Dataflow provides MapReduce-like parallelism, combined >> with >> >>>>>>> support for powerful data windowing, and fine-grained correctness >> >>>>>> >> >>>>>> >> >>>>>> control. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> == Background == >> >>>>>>> >> >>>>>>> Dataflow started as a set of Google projects focused on making data >> >>>>>>> processing easier, faster, and less costly. The Dataflow model is a >> >>>>>>> successor to MapReduce, FlumeJava, and Millwheel inside Google and >> is >> >>>>>>> focused on providing a unified solution for batch and stream >> >>>>>> >> >>>>>> >> >>>>>> processing. >> >>>>>>> >> >>>>>>> >> >>>>>>> These projects on which Dataflow is based have been published in >> >>>>>> >> >>>>>> >> >>>>>> several >> >>>>>>> >> >>>>>>> >> >>>>>>> papers made available to the public: >> >>>>>>> >> >>>>>>> * MapReduce - http://research.google.com/archive/mapreduce.html >> >>>>>>> >> >>>>>>> * Dataflow model - >> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf >> >>>>>>> >> >>>>>>> * FlumeJava - http://notes.stephenholiday.com/FlumeJava.pdf >> >>>>>>> >> >>>>>>> * MillWheel - http://research.google.com/pubs/pub41378.html >> >>>>>>> >> >>>>>>> Dataflow was designed from the start to provide a portable >> >>>>>>> programming >> >>>>>>> layer. When you define a data processing pipeline with the Dataflow >> >>>>>> >> >>>>>> >> >>>>>> model, >> >>>>>>> >> >>>>>>> >> >>>>>>> you are creating a job which is capable of being processed by any >> >>>>>> >> >>>>>> >> >>>>>> number >> >>>>>> of >> >>>>>>> >> >>>>>>> >> >>>>>>> Dataflow processing engines. Several engines have been developed to >> >>>>>> >> >>>>>> >> >>>>>> run >> >>>>>>> >> >>>>>>> >> >>>>>>> Dataflow pipelines in other open source runtimes, including a >> >>>>>>> Dataflow >> >>>>>>> runner for Apache Flink and Apache Spark. There is also a ³direct >> >>>>>> >> >>>>>> >> >>>>>> runner², >> >>>>>>> >> >>>>>>> >> >>>>>>> for execution on the developer machine (mainly for dev/debug >> >>>>>> >> >>>>>> >> >>>>>> purposes). >> >>>>>>> >> >>>>>>> >> >>>>>>> Another runner allows a Dataflow program to run on a managed >> service, >> >>>>>>> Google Cloud Dataflow, in Google Cloud Platform. The Dataflow Java >> >>>>>> >> >>>>>> >> >>>>>> SDK is >> >>>>>>> >> >>>>>>> >> >>>>>>> already available on GitHub, and independent from the Google Cloud >> >>>>>> >> >>>>>> >> >>>>>> Dataflow >> >>>>>>> >> >>>>>>> >> >>>>>>> service. Another Python SDK is currently in active development. >> >>>>>>> >> >>>>>>> In this proposal, the Dataflow SDKs, model, and a set of runners >> will >> >>>>>> >> >>>>>> >> >>>>>> be >> >>>>>>> >> >>>>>>> >> >>>>>>> submitted as an OSS project under the ASF. The runners which are a >> >>>>>> >> >>>>>> >> >>>>>> part >> >>>>>> of >> >>>>>>> >> >>>>>>> >> >>>>>>> this proposal include those for Spark (from Cloudera), Flink (from >> >>>>>> >> >>>>>> >> >>>>>> data >> >>>>>>> >> >>>>>>> >> >>>>>>> Artisans), and local development (from Google); the Google Cloud >> >>>>>> >> >>>>>> >> >>>>>> Dataflow >> >>>>>>> >> >>>>>>> >> >>>>>>> service runner is not included in this proposal. Further references >> >>>>>>> to >> >>>>>>> Dataflow will refer to the Dataflow model, SDKs, and runners which >> >>>>>> >> >>>>>> >> >>>>>> are a >> >>>>>>> >> >>>>>>> >> >>>>>>> part of this proposal (Apache Dataflow) only. The initial >> submission >> >>>>>> >> >>>>>> >> >>>>>> will >> >>>>>>> >> >>>>>>> >> >>>>>>> contain the already-released Java SDK; Google intends to submit the >> >>>>>> >> >>>>>> >> >>>>>> Python >> >>>>>>> >> >>>>>>> >> >>>>>>> SDK later in the incubation process. The Google Cloud Dataflow >> >>>>>>> service >> >>>>>> >> >>>>>> >> >>>>>> will >> >>>>>>> >> >>>>>>> >> >>>>>>> continue to be one of many runners for Dataflow, built on Google >> >>>>>>> Cloud >> >>>>>>> Platform, to run Dataflow pipelines. Necessarily, Cloud Dataflow >> will >> >>>>>>> develop against the Apache project additions, updates, and changes. >> >>>>>> >> >>>>>> >> >>>>>> Google >> >>>>>>> >> >>>>>>> >> >>>>>>> Cloud Dataflow will become one user of Apache Dataflow and will >> >>>>>> >> >>>>>> >> >>>>>> participate >> >>>>>>> >> >>>>>>> >> >>>>>>> in the project openly and publicly. >> >>>>>>> >> >>>>>>> The Dataflow programming model has been designed with simplicity, >> >>>>>>> scalability, and speed as key tenants. In the Dataflow model, you >> >>>>>>> only >> >>>>>> >> >>>>>> >> >>>>>> need >> >>>>>>> >> >>>>>>> >> >>>>>>> to think about four top-level concepts when constructing your data >> >>>>>>> processing job: >> >>>>>>> >> >>>>>>> * Pipelines - The data processing job made of a series of >> >>>>>>> computations >> >>>>>>> including input, processing, and output >> >>>>>>> >> >>>>>>> * PCollections - Bounded (or unbounded) datasets which represent >> the >> >>>>>> >> >>>>>> >> >>>>>> input, >> >>>>>>> >> >>>>>>> >> >>>>>>> intermediate and output data in pipelines >> >>>>>>> >> >>>>>>> * PTransforms - A data processing step in a pipeline in which one >> or >> >>>>>> >> >>>>>> >> >>>>>> more >> >>>>>>> >> >>>>>>> >> >>>>>>> PCollections are an input and output >> >>>>>>> >> >>>>>>> * I/O Sources and Sinks - APIs for reading and writing data which >> are >> >>>>>> >> >>>>>> >> >>>>>> the >> >>>>>>> >> >>>>>>> >> >>>>>>> roots and endpoints of the pipeline >> >>>>>>> >> >>>>>>> == Rationale == >> >>>>>>> >> >>>>>>> With Dataflow, Google intended to develop a framework which allowed >> >>>>>>> developers to be maximally productive in defining the processing, >> and >> >>>>>> >> >>>>>> >> >>>>>> then >> >>>>>>> >> >>>>>>> >> >>>>>>> be able to execute the program at various levels of >> >>>>>>> latency/cost/completeness without re-architecting or re-writing it. >> >>>>>> >> >>>>>> >> >>>>>> This >> >>>>>>> >> >>>>>>> >> >>>>>>> goal was informed by Google¹s past experience developing several >> >>>>>> >> >>>>>> >> >>>>>> models, >> >>>>>>> >> >>>>>>> >> >>>>>>> frameworks, and tools useful for large-scale and distributed data >> >>>>>>> processing. While Google has previously published papers describing >> >>>>>> >> >>>>>> >> >>>>>> some >> >>>>>> of >> >>>>>>> >> >>>>>>> >> >>>>>>> its technologies, Google decided to take a different approach with >> >>>>>>> Dataflow. Google open-sourced the SDK and model alongside >> >>>>>> >> >>>>>> >> >>>>>> commercialization >> >>>>>>> >> >>>>>>> >> >>>>>>> of the idea and ahead of publishing papers on the topic. As a >> result, >> >>>>>> >> >>>>>> >> >>>>>> a >> >>>>>>> >> >>>>>>> >> >>>>>>> number of open source runtimes exist for Dataflow, such as the >> Apache >> >>>>>> >> >>>>>> >> >>>>>> Flink >> >>>>>>> >> >>>>>>> >> >>>>>>> and Apache Spark runners. >> >>>>>>> >> >>>>>>> We believe that submitting Dataflow as an Apache project will >> provide >> >>>>>> >> >>>>>> >> >>>>>> an >> >>>>>>> >> >>>>>>> >> >>>>>>> immediate, worthwhile, and substantial contribution to the open >> >>>>>>> source >> >>>>>>> community. As an incubating project, we believe Dataflow will have >> a >> >>>>>> >> >>>>>> >> >>>>>> better >> >>>>>>> >> >>>>>>> >> >>>>>>> opportunity to provide a meaningful contribution to OSS and also >> >>>>>> >> >>>>>> >> >>>>>> integrate >> >>>>>>> >> >>>>>>> >> >>>>>>> with other Apache projects. >> >>>>>>> >> >>>>>>> In the long term, we believe Dataflow can be a powerful abstraction >> >>>>>> >> >>>>>> >> >>>>>> layer >> >>>>>>> >> >>>>>>> >> >>>>>>> for data processing. By providing an abstraction layer for data >> >>>>>> >> >>>>>> >> >>>>>> pipelines >> >>>>>>> >> >>>>>>> >> >>>>>>> and processing, data workflows can be increasingly portable, >> >>>>>> >> >>>>>> >> >>>>>> resilient to >> >>>>>>> >> >>>>>>> >> >>>>>>> breaking changes in tooling, and compatible across many execution >> >>>>>> >> >>>>>> >> >>>>>> engines, >> >>>>>>> >> >>>>>>> >> >>>>>>> runtimes, and open source projects. >> >>>>>>> >> >>>>>>> == Initial Goals == >> >>>>>>> >> >>>>>>> We are breaking our initial goals into immediate (< 2 months), >> >>>>>> >> >>>>>> >> >>>>>> short-term >> >>>>>>> >> >>>>>>> >> >>>>>>> (2-4 months), and intermediate-term (> 4 months). >> >>>>>>> >> >>>>>>> Our immediate goals include the following: >> >>>>>>> >> >>>>>>> * Plan for reconciling the Dataflow Java SDK and various runners >> into >> >>>>>> >> >>>>>> >> >>>>>> one >> >>>>>>> >> >>>>>>> >> >>>>>>> project >> >>>>>>> >> >>>>>>> * Plan for refactoring the existing Java SDK for better >> extensibility >> >>>>>> >> >>>>>> >> >>>>>> by >> >>>>>>> >> >>>>>>> >> >>>>>>> SDK and runner writers >> >>>>>>> >> >>>>>>> * Validating all dependencies are ASL 2.0 or compatible >> >>>>>>> >> >>>>>>> * Understanding and adapting to the Apache development process >> >>>>>>> >> >>>>>>> Our short-term goals include: >> >>>>>>> >> >>>>>>> * Moving the newly-merged lists, and build utilities to Apache >> >>>>>>> >> >>>>>>> * Start refactoring codebase and move code to Apache Git repo >> >>>>>>> >> >>>>>>> * Continue development of new features, functions, and fixes in the >> >>>>>>> Dataflow Java SDK, and Dataflow runners >> >>>>>>> >> >>>>>>> * Cleaning up the Dataflow SDK sources and crafting a roadmap and >> >>>>>>> plan >> >>>>>> >> >>>>>> >> >>>>>> for >> >>>>>>> >> >>>>>>> >> >>>>>>> how to include new major ideas, modules, and runtimes >> >>>>>>> >> >>>>>>> * Establishment of easy and clear build/test framework for Dataflow >> >>>>>> >> >>>>>> >> >>>>>> and >> >>>>>>> >> >>>>>>> >> >>>>>>> associated runtimes; creation of testing, rollback, and validation >> >>>>>> >> >>>>>> >> >>>>>> policy >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> * Analysis and design for work needed to make Dataflow a better >> data >> >>>>>>> processing abstraction layer for multiple open source frameworks >> and >> >>>>>>> environments >> >>>>>>> >> >>>>>>> Finally, we have a number of intermediate-term goals: >> >>>>>>> >> >>>>>>> * Roadmapping, planning, and execution of integrations with other >> OSS >> >>>>>> >> >>>>>> >> >>>>>> and >> >>>>>>> >> >>>>>>> >> >>>>>>> non-OSS projects/products >> >>>>>>> >> >>>>>>> * Inclusion of additional SDK for Python, which is under active >> >>>>>> >> >>>>>> >> >>>>>> development >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> == Current Status == >> >>>>>>> >> >>>>>>> === Meritocracy === >> >>>>>>> >> >>>>>>> Dataflow was initially developed based on ideas from many employees >> >>>>>> >> >>>>>> >> >>>>>> within >> >>>>>>> >> >>>>>>> >> >>>>>>> Google. As an ASL OSS project on GitHub, the Dataflow SDK has >> >>>>>>> received >> >>>>>>> contributions from data Artisans, Cloudera Labs, and other >> individual >> >>>>>>> developers. As a project under incubation, we are committed to >> >>>>>> >> >>>>>> >> >>>>>> expanding >> >>>>>>> >> >>>>>>> >> >>>>>>> our effort to build an environment which supports a meritocracy. We >> >>>>>> >> >>>>>> >> >>>>>> are >> >>>>>>> >> >>>>>>> >> >>>>>>> focused on engaging the community and other related projects for >> >>>>>> >> >>>>>> >> >>>>>> support >> >>>>>>> >> >>>>>>> >> >>>>>>> and contributions. Moreover, we are committed to ensure >> contributors >> >>>>>> >> >>>>>> >> >>>>>> and >> >>>>>>> >> >>>>>>> >> >>>>>>> committers to Dataflow come from a broad mix of organizations >> through >> >>>>>> >> >>>>>> >> >>>>>> a >> >>>>>>> >> >>>>>>> >> >>>>>>> merit-based decision process during incubation. We believe strongly >> >>>>>>> in >> >>>>>> >> >>>>>> >> >>>>>> the >> >>>>>>> >> >>>>>>> >> >>>>>>> Dataflow model and are committed to growing an inclusive community >> of >> >>>>>>> Dataflow contributors. >> >>>>>>> >> >>>>>>> === Community === >> >>>>>>> >> >>>>>>> The core of the Dataflow Java SDK has been developed by Google for >> >>>>>>> use >> >>>>>> >> >>>>>> >> >>>>>> with >> >>>>>>> >> >>>>>>> >> >>>>>>> Google Cloud Dataflow. Google has active community engagement in >> the >> >>>>>> >> >>>>>> >> >>>>>> SDK >> >>>>>>> >> >>>>>>> >> >>>>>>> GitHub repository ( >> >>>>>> >> >>>>>> >> >>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK >> >>>>>>> >> >>>>>>> >> >>>>>>> ), >> >>>>>>> on Stack Overflow ( >> >>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow) >> and >> >>>>>> >> >>>>>> >> >>>>>> has >> >>>>>>> >> >>>>>>> >> >>>>>>> had contributions from a number of organizations and indivuduals. >> >>>>>>> >> >>>>>>> Everyday, Cloud Dataflow is actively used by a number of >> >>>>>>> organizations >> >>>>>> >> >>>>>> >> >>>>>> and >> >>>>>>> >> >>>>>>> >> >>>>>>> institutions for batch and stream processing of data. We believe >> >>>>>> >> >>>>>> >> >>>>>> acceptance >> >>>>>>> >> >>>>>>> >> >>>>>>> will allow us to consolidate existing Dataflow-related work, grow >> the >> >>>>>>> Dataflow community, and deepen connections between Dataflow and >> other >> >>>>>> >> >>>>>> >> >>>>>> open >> >>>>>>> >> >>>>>>> >> >>>>>>> source projects. >> >>>>>>> >> >>>>>>> === Core Developers === >> >>>>>>> >> >>>>>>> The core developers for Dataflow and the Dataflow runners are: >> >>>>>>> >> >>>>>>> * Frances Perry >> >>>>>>> >> >>>>>>> * Tyler Akidau >> >>>>>>> >> >>>>>>> * Davor Bonaci >> >>>>>>> >> >>>>>>> * Luke Cwik >> >>>>>>> >> >>>>>>> * Ben Chambers >> >>>>>>> >> >>>>>>> * Kenn Knowles >> >>>>>>> >> >>>>>>> * Dan Halperin >> >>>>>>> >> >>>>>>> * Daniel Mills >> >>>>>>> >> >>>>>>> * Mark Shields >> >>>>>>> >> >>>>>>> * Craig Chambers >> >>>>>>> >> >>>>>>> * Maximilian Michels >> >>>>>>> >> >>>>>>> * Tom White >> >>>>>>> >> >>>>>>> * Josh Wills >> >>>>>>> >> >>>>>>> === Alignment === >> >>>>>>> >> >>>>>>> The Dataflow SDK can be used to create Dataflow pipelines which can >> >>>>>>> be >> >>>>>>> executed on Apache Spark or Apache Flink. Dataflow is also related >> to >> >>>>>> >> >>>>>> >> >>>>>> other >> >>>>>>> >> >>>>>>> >> >>>>>>> Apache projects, such as Apache Crunch. We plan on expanding >> >>>>>> >> >>>>>> >> >>>>>> functionality >> >>>>>>> >> >>>>>>> >> >>>>>>> for Dataflow runners, support for additional domain specific >> >>>>>> >> >>>>>> >> >>>>>> languages, >> >>>>>> and >> >>>>>>> >> >>>>>>> >> >>>>>>> increased portability so Dataflow is a powerful abstraction layer >> for >> >>>>>> >> >>>>>> >> >>>>>> data >> >>>>>>> >> >>>>>>> >> >>>>>>> processing. >> >>>>>>> >> >>>>>>> == Known Risks == >> >>>>>>> >> >>>>>>> === Orphaned Products === >> >>>>>>> >> >>>>>>> The Dataflow SDK is presently used by several organizations, from >> >>>>>> >> >>>>>> >> >>>>>> small >> >>>>>>> >> >>>>>>> >> >>>>>>> startups to Fortune 100 companies, to construct production >> pipelines >> >>>>>> >> >>>>>> >> >>>>>> which >> >>>>>>> >> >>>>>>> >> >>>>>>> are executed in Google Cloud Dataflow. Google has a long-term >> >>>>>> >> >>>>>> >> >>>>>> commitment >> >>>>>> to >> >>>>>>> >> >>>>>>> >> >>>>>>> advance the Dataflow SDK; moreover, Dataflow is seeing increasing >> >>>>>> >> >>>>>> >> >>>>>> interest, >> >>>>>>> >> >>>>>>> >> >>>>>>> development, and adoption from organizations outside of Google. >> >>>>>>> >> >>>>>>> === Inexperience with Open Source === >> >>>>>>> >> >>>>>>> Google believes strongly in open source and the exchange of >> >>>>>> >> >>>>>> >> >>>>>> information >> >>>>>> to >> >>>>>>> >> >>>>>>> >> >>>>>>> advance new ideas and work. Examples of this commitment are active >> >>>>>>> OSS >> >>>>>>> projects such as Chromium (https://www.chromium.org) and >> Kubernetes ( >> >>>>>>> http://kubernetes.io/). With Dataflow, we have tried to be >> >>>>>> >> >>>>>> >> >>>>>> increasingly >> >>>>>>> >> >>>>>>> >> >>>>>>> open and forward-looking; we have published a paper in the VLDB >> >>>>>> >> >>>>>> >> >>>>>> conference >> >>>>>>> >> >>>>>>> >> >>>>>>> describing the Dataflow model ( >> >>>>>>> http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) and were quick to >> >>>>>> >> >>>>>> >> >>>>>> release >> >>>>>>> >> >>>>>>> >> >>>>>>> the Dataflow SDK as open source software with the launch of Cloud >> >>>>>> >> >>>>>> >> >>>>>> Dataflow. >> >>>>>>> >> >>>>>>> >> >>>>>>> Our submission to the Apache Software Foundation is a logical >> >>>>>> >> >>>>>> >> >>>>>> extension >> >>>>>> of >> >>>>>>> >> >>>>>>> >> >>>>>>> our commitment to open source software. >> >>>>>>> >> >>>>>>> === Homogeneous Developers === >> >>>>>>> >> >>>>>>> The majority of committers in this proposal belong to Google due to >> >>>>>> >> >>>>>> >> >>>>>> the >> >>>>>>> >> >>>>>>> >> >>>>>>> fact that Dataflow has emerged from several internal Google >> projects. >> >>>>>> >> >>>>>> >> >>>>>> This >> >>>>>>> >> >>>>>>> >> >>>>>>> proposal also includes committers outside of Google who are >> actively >> >>>>>>> involved with other Apache projects, such as Hadoop, Flink, and >> >>>>>>> Spark. >> >>>>>> >> >>>>>> >> >>>>>> We >> >>>>>>> >> >>>>>>> >> >>>>>>> expect our entry into incubation will allow us to expand the number >> >>>>>>> of >> >>>>>>> individuals and organizations participating in Dataflow >> development. >> >>>>>>> Additionally, separation of the Dataflow SDK from Google Cloud >> >>>>>> >> >>>>>> >> >>>>>> Dataflow >> >>>>>>> >> >>>>>>> >> >>>>>>> allows us to focus on the open source SDK and model and do what is >> >>>>>> >> >>>>>> >> >>>>>> best >> >>>>>> for >> >>>>>>> >> >>>>>>> >> >>>>>>> this project. >> >>>>>>> >> >>>>>>> === Reliance on Salaried Developers === >> >>>>>>> >> >>>>>>> The Dataflow SDK and Dataflow runners have been developed primarily >> >>>>>>> by >> >>>>>>> salaried developers supporting the Google Cloud Dataflow project. >> >>>>>> >> >>>>>> >> >>>>>> While >> >>>>>> the >> >>>>>>> >> >>>>>>> >> >>>>>>> Dataflow SDK and Cloud Dataflow have been developed by different >> >>>>>>> teams >> >>>>>> >> >>>>>> >> >>>>>> (and >> >>>>>>> >> >>>>>>> >> >>>>>>> this proposal would reinforce that separation) we expect our >> initial >> >>>>>> >> >>>>>> >> >>>>>> set >> >>>>>> of >> >>>>>>> >> >>>>>>> >> >>>>>>> developers will still primarily be salaried. Contribution has not >> >>>>>>> been >> >>>>>>> exclusively from salaried developers, however. For example, the >> >>>>>> >> >>>>>> >> >>>>>> contrib >> >>>>>>> >> >>>>>>> >> >>>>>>> directory of the Dataflow SDK ( >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/master/contri >> >>>>>> b >> >>>>>>> >> >>>>>>> >> >>>>>>> ) >> >>>>>>> contains items from free-time contributors. Moreover, seperate >> >>>>>> >> >>>>>> >> >>>>>> projects, >> >>>>>>> >> >>>>>>> >> >>>>>>> such as ScalaFlow (https://github.com/darkjh/scalaflow) have been >> >>>>>> >> >>>>>> >> >>>>>> created >> >>>>>>> >> >>>>>>> >> >>>>>>> around the Dataflow model and SDK. We expect our reliance on >> salaried >> >>>>>>> developers will decrease over time during incubation. >> >>>>>>> >> >>>>>>> === Relationship with other Apache products === >> >>>>>>> >> >>>>>>> Dataflow directly interoperates with or utilizes several existing >> >>>>>> >> >>>>>> >> >>>>>> Apache >> >>>>>>> >> >>>>>>> >> >>>>>>> projects. >> >>>>>>> >> >>>>>>> * Build >> >>>>>>> >> >>>>>>> ** Apache Maven >> >>>>>>> >> >>>>>>> * Data I/O, Libraries >> >>>>>>> >> >>>>>>> ** Apache Avro >> >>>>>>> >> >>>>>>> ** Apache Commons >> >>>>>>> >> >>>>>>> * Dataflow runners >> >>>>>>> >> >>>>>>> ** Apache Flink >> >>>>>>> >> >>>>>>> ** Apache Spark >> >>>>>>> >> >>>>>>> Dataflow when used in batch mode shares similarities with Apache >> >>>>>> >> >>>>>> >> >>>>>> Crunch; >> >>>>>>> >> >>>>>>> >> >>>>>>> however, Dataflow is focused on a model, SDK, and abstraction layer >> >>>>>> >> >>>>>> >> >>>>>> beyond >> >>>>>>> >> >>>>>>> >> >>>>>>> Spark and Hadoop (MapReduce.) One key goal of Dataflow is to >> provide >> >>>>>> >> >>>>>> >> >>>>>> an >> >>>>>>> >> >>>>>>> >> >>>>>>> intermediate abstraction layer which can easily be implemented and >> >>>>>> >> >>>>>> >> >>>>>> utilized >> >>>>>>> >> >>>>>>> >> >>>>>>> across several different processing frameworks. >> >>>>>>> >> >>>>>>> === An excessive fascination with the Apache brand === >> >>>>>>> >> >>>>>>> With this proposal we are not seeking attention or publicity. >> Rather, >> >>>>>> >> >>>>>> >> >>>>>> we >> >>>>>>> >> >>>>>>> >> >>>>>>> firmly believe in the Dataflow model, SDK, and the ability to make >> >>>>>> >> >>>>>> >> >>>>>> Dataflow >> >>>>>>> >> >>>>>>> >> >>>>>>> a powerful yet simple framework for data processing. While the >> >>>>>> >> >>>>>> >> >>>>>> Dataflow >> >>>>>> SDK >> >>>>>>> >> >>>>>>> >> >>>>>>> and model have been open source, we believe putting code on GitHub >> >>>>>>> can >> >>>>>> >> >>>>>> >> >>>>>> only >> >>>>>>> >> >>>>>>> >> >>>>>>> go so far. We see the Apache community, processes, and mission as >> >>>>>> >> >>>>>> >> >>>>>> critical >> >>>>>>> >> >>>>>>> >> >>>>>>> for ensuring the Dataflow SDK and model are truly community-driven, >> >>>>>>> positively impactful, and innovative open source software. While >> >>>>>> >> >>>>>> >> >>>>>> Google >> >>>>>> has >> >>>>>>> >> >>>>>>> >> >>>>>>> taken a number of steps to advance its various open source >> projects, >> >>>>>> >> >>>>>> >> >>>>>> we >> >>>>>>> >> >>>>>>> >> >>>>>>> believe Dataflow is a great fit for the Apache Software Foundation >> >>>>>> >> >>>>>> >> >>>>>> due to >> >>>>>>> >> >>>>>>> >> >>>>>>> its focus on data processing and its relationships to existing ASF >> >>>>>>> projects. >> >>>>>>> >> >>>>>>> == Documentation == >> >>>>>>> >> >>>>>>> The following documentation is relevant to this proposal. Relevant >> >>>>>> >> >>>>>> >> >>>>>> portion >> >>>>>>> >> >>>>>>> >> >>>>>>> of the documentation will be contributed to the Apache Dataflow >> >>>>>> >> >>>>>> >> >>>>>> project. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> * Dataflow website: https://cloud.google.com/dataflow >> >>>>>>> >> >>>>>>> * Dataflow programming model: >> >>>>>>> https://cloud.google.com/dataflow/model/programming-model >> >>>>>>> >> >>>>>>> * Codebases >> >>>>>>> >> >>>>>>> ** Dataflow Java SDK: >> >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK >> >>>>>>> >> >>>>>>> ** Flink Dataflow runner: >> >>>>>> >> >>>>>> >> >>>>>> https://github.com/dataArtisans/flink-dataflow >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> ** Spark Dataflow runner: >> https://github.com/cloudera/spark-dataflow >> >>>>>>> >> >>>>>>> * Dataflow Java SDK issue tracker: >> >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues >> >>>>>>> >> >>>>>>> * google-cloud-dataflow tag on Stack Overflow: >> >>>>>>> http://stackoverflow.com/questions/tagged/google-cloud-dataflow >> >>>>>>> >> >>>>>>> == Initial Source == >> >>>>>>> >> >>>>>>> The initial source for Dataflow which we will submit to the Apache >> >>>>>>> Foundation will include several related projects which are >> currently >> >>>>>> >> >>>>>> >> >>>>>> hosted >> >>>>>>> >> >>>>>>> >> >>>>>>> on the GitHub repositories: >> >>>>>>> >> >>>>>>> * Dataflow Java SDK ( >> >>>>>>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK) >> >>>>>>> >> >>>>>>> * Flink Dataflow runner >> >>>>>> >> >>>>>> >> >>>>>> (https://github.com/dataArtisans/flink-dataflow) >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> * Spark Dataflow runner ( >> https://github.com/cloudera/spark-dataflow) >> >>>>>>> >> >>>>>>> These projects have always been Apache 2.0 licensed. We intend to >> >>>>>> >> >>>>>> >> >>>>>> bundle >> >>>>>>> >> >>>>>>> >> >>>>>>> all of these repositories since they are all complimentary and >> should >> >>>>>> >> >>>>>> >> >>>>>> be >> >>>>>>> >> >>>>>>> >> >>>>>>> maintained in one project. Prior to our submission, we will combine >> >>>>>> >> >>>>>> >> >>>>>> all >> >>>>>> of >> >>>>>>> >> >>>>>>> >> >>>>>>> these projects into a new git repository. >> >>>>>>> >> >>>>>>> == Source and Intellectual Property Submission Plan == >> >>>>>>> >> >>>>>>> The source for the Dataflow SDK and the three runners (Spark, >> Flink, >> >>>>>> >> >>>>>> >> >>>>>> Google >> >>>>>>> >> >>>>>>> >> >>>>>>> Cloud Dataflow) are already licensed under an Apache 2 license. >> >>>>>>> >> >>>>>>> * Dataflow SDK - >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/LICENS >> >>>>>> E >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> * Flink runner - >> >>>>>>> https://github.com/dataArtisans/flink-dataflow/blob/master/LICENSE >> >>>>>>> >> >>>>>>> * Spark runner - >> >>>>>>> https://github.com/cloudera/spark-dataflow/blob/master/LICENSE >> >>>>>>> >> >>>>>>> Contributors to the Dataflow SDK have also signed the Google >> >>>>>> >> >>>>>> >> >>>>>> Individual >> >>>>>>> >> >>>>>>> >> >>>>>>> Contributor License Agreement ( >> >>>>>>> https://cla.developers.google.com/about/google-individual) in >> order >> >>>>>>> to >> >>>>>>> contribute to the project. >> >>>>>>> >> >>>>>>> With respect to trademark rights, Google does not hold a trademark >> on >> >>>>>> >> >>>>>> >> >>>>>> the >> >>>>>>> >> >>>>>>> >> >>>>>>> phrase ³Dataflow.² Based on feedback and guidance we receive during >> >>>>>> >> >>>>>> >> >>>>>> the >> >>>>>>> >> >>>>>>> >> >>>>>>> incubation process, we are open to renaming the project if >> necessary >> >>>>>> >> >>>>>> >> >>>>>> for >> >>>>>>> >> >>>>>>> >> >>>>>>> trademark or other concerns. >> >>>>>>> >> >>>>>>> == External Dependencies == >> >>>>>>> >> >>>>>>> All external dependencies are licensed under an Apache 2.0 or >> >>>>>>> Apache-compatible license. As we grow the Dataflow community we >> will >> >>>>>>> configure our build process to require and validate all >> contributions >> >>>>>> >> >>>>>> >> >>>>>> and >> >>>>>>> >> >>>>>>> >> >>>>>>> dependencies are licensed under the Apache 2.0 license or are under >> >>>>>>> an >> >>>>>>> Apache-compatible license. >> >>>>>>> >> >>>>>>> == Required Resources == >> >>>>>>> >> >>>>>>> === Mailing Lists === >> >>>>>>> >> >>>>>>> We currently use a mix of mailing lists. We will migrate our >> existing >> >>>>>>> mailing lists to the following: >> >>>>>>> >> >>>>>>> * d...@dataflow.incubator.apache.org >> >>>>>>> >> >>>>>>> * u...@dataflow.incubator.apache.org >> >>>>>>> >> >>>>>>> * priv...@dataflow.incubator.apache.org >> >>>>>>> >> >>>>>>> * comm...@dataflow.incubator.apache.org >> >>>>>>> >> >>>>>>> === Source Control === >> >>>>>>> >> >>>>>>> The Dataflow team currently uses Git and would like to continue to >> do >> >>>>>> >> >>>>>> >> >>>>>> so. >> >>>>>>> >> >>>>>>> >> >>>>>>> We request a Git repository for Dataflow with mirroring to GitHub >> >>>>>> >> >>>>>> >> >>>>>> enabled. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> === Issue Tracking === >> >>>>>>> >> >>>>>>> We request the creation of an Apache-hosted JIRA. The Dataflow >> >>>>>> >> >>>>>> >> >>>>>> project is >> >>>>>>> >> >>>>>>> >> >>>>>>> currently using both a public GitHub issue tracker and internal >> >>>>>>> Google >> >>>>>>> issue tracking. We will migrate and combine from these two sources >> to >> >>>>>> >> >>>>>> >> >>>>>> the >> >>>>>>> >> >>>>>>> >> >>>>>>> Apache JIRA. >> >>>>>>> >> >>>>>>> == Initial Committers == >> >>>>>>> >> >>>>>>> * Aljoscha Krettek [aljos...@apache.org] >> >>>>>>> >> >>>>>>> * Amit Sela [amitsel...@gmail.com] >> >>>>>>> >> >>>>>>> * Ben Chambers [bchamb...@google.com] >> >>>>>>> >> >>>>>>> * Craig Chambers [chamb...@google.com] >> >>>>>>> >> >>>>>>> * Dan Halperin [dhalp...@google.com] >> >>>>>>> >> >>>>>>> * Davor Bonaci [da...@google.com] >> >>>>>>> >> >>>>>>> * Frances Perry [f...@google.com] >> >>>>>>> >> >>>>>>> * James Malone [jamesmal...@google.com] >> >>>>>>> >> >>>>>>> * Jean-Baptiste Onofré [jbono...@apache.org] >> >>>>>>> >> >>>>>>> * Josh Wills [jwi...@apache.org] >> >>>>>>> >> >>>>>>> * Kostas Tzoumas [kos...@data-artisans.com] >> >>>>>>> >> >>>>>>> * Kenneth Knowles [k...@google.com] >> >>>>>>> >> >>>>>>> * Luke Cwik [lc...@google.com] >> >>>>>>> >> >>>>>>> * Maximilian Michels [m...@apache.org] >> >>>>>>> >> >>>>>>> * Stephan Ewen [step...@data-artisans.com] >> >>>>>>> >> >>>>>>> * Tom White [t...@cloudera.com] >> >>>>>>> >> >>>>>>> * Tyler Akidau [taki...@google.com] >> >>>>>>> >> >>>>>>> == Affiliations == >> >>>>>>> >> >>>>>>> The initial committers are from six organizations. Google developed >> >>>>>>> Dataflow and the Dataflow SDK, data Artisans developed the Flink >> >>>>>> >> >>>>>> >> >>>>>> runner, >> >>>>>>> >> >>>>>>> >> >>>>>>> and Cloudera (Labs) developed the Spark runner. >> >>>>>>> >> >>>>>>> * Cloudera >> >>>>>>> >> >>>>>>> ** Tom White >> >>>>>>> >> >>>>>>> * Data Artisans >> >>>>>>> >> >>>>>>> ** Aljoscha Krettek >> >>>>>>> >> >>>>>>> ** Kostas Tzoumas >> >>>>>>> >> >>>>>>> ** Maximilian Michels >> >>>>>>> >> >>>>>>> ** Stephan Ewen >> >>>>>>> >> >>>>>>> * Google >> >>>>>>> >> >>>>>>> ** Ben Chambers >> >>>>>>> >> >>>>>>> ** Dan Halperin >> >>>>>>> >> >>>>>>> ** Davor Bonaci >> >>>>>>> >> >>>>>>> ** Frances Perry >> >>>>>>> >> >>>>>>> ** James Malone >> >>>>>>> >> >>>>>>> ** Kenneth Knowles >> >>>>>>> >> >>>>>>> ** Luke Cwik >> >>>>>>> >> >>>>>>> ** Tyler Akidau >> >>>>>>> >> >>>>>>> * PayPal >> >>>>>>> >> >>>>>>> ** Amit Sela >> >>>>>>> >> >>>>>>> * Slack >> >>>>>>> >> >>>>>>> ** Josh Wills >> >>>>>>> >> >>>>>>> * Talend >> >>>>>>> >> >>>>>>> ** Jean-Baptiste Onofré >> >>>>>>> >> >>>>>>> == Sponsors == >> >>>>>>> >> >>>>>>> === Champion === >> >>>>>>> >> >>>>>>> * Jean-Baptiste Onofre [jbono...@apache.org] >> >>>>>>> >> >>>>>>> === Nominated Mentors === >> >>>>>>> >> >>>>>>> * Jim Jagielski [j...@apache.org] >> >>>>>>> >> >>>>>>> * Venkatesh Seetharam [venkat...@apache.org] >> >>>>>>> >> >>>>>>> * Bertrand Delacretaz [bdelacre...@apache.org] >> >>>>>>> >> >>>>>>> * Ted Dunning [tdunn...@apache.org] >> >>>>>>> >> >>>>>>> === Sponsoring Entity === >> >>>>>>> >> >>>>>>> The Apache Incubator >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> Sean >> >>>>>> >> >>>> >> >>>> >> >>>> --------------------------------------------------------------------- >> >>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> >>>> For additional commands, e-mail: general-h...@incubator.apache.org >> >>>> >> >>> >> >>> -- >> >>> Jean-Baptiste Onofré >> >>> jbono...@apache.org >> >>> http://blog.nanthrax.net >> >>> Talend - http://www.talend.com >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> >>> For additional commands, e-mail: general-h...@incubator.apache.org >> >>> >> >> >> >> >> >> >> > >> > -- >> > Jean-Baptiste Onofré >> > jbono...@apache.org >> > http://blog.nanthrax.net >> > Talend - http://www.talend.com >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> > For additional commands, e-mail: general-h...@incubator.apache.org >> > >> >> >> >> -- >> thanks >> ashish >> >> Blog: http://www.ashishpaliwal.com/blog >> My Photo Galleries: http://www.pbase.com/ashishpaliwal >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >> -- thanks ashish Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: http://www.pbase.com/ashishpaliwal --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org