[48/50] [abbrv] incubator-beam git commit: Update README for initial code drop.

jamesmalone Fri, 26 Feb 2016 14:55:09 -0800

Update README for initial code drop.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/3623a237
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/3623a237
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/3623a237

Branch: refs/heads/master
Commit: 3623a237fb1d40ace9d2a06690f89cb3ff3dbb20
Parents: 394390f
Author: Frances Perry <[email protected]>
Authored: Fri Feb 26 12:22:15 2016 -0800
Committer: Frances Perry <[email protected]>
Committed: Fri Feb 26 12:22:15 2016 -0800

----------------------------------------------------------------------
 README.md | 138 ++++++++++++++++++---------------------------------------
 1 file changed, 43 insertions(+), 95 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/3623a237/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index d5345a3..db4a13f 100644
--- a/README.md
+++ b/README.md
@@ -1,125 +1,73 @@
-# Google Cloud Dataflow SDK for Java
+# Apache Beam
 
-[Google Cloud Dataflow](https://cloud.google.com/dataflow/) provides a simple,
-powerful programming model for building both batch and streaming parallel data
-processing pipelines. This repository hosts the open-sourced Cloud Dataflow SDK
-for Java, which can be used to run pipelines against the Google Cloud Dataflow
-Service.
+[Apache Beam](http://beam.incubator.apache.org) is a unified model for 
defining both batch and streaming data-parallel processing pipelines, as well 
as a set of language-specific SDKs for constructing pipelines and Runners for 
executing them on distributed processing backends like [Apache 
Spark](http://spark.apache.org/), [Apache Flink](http://flink.apache.org), and 
[Google Cloud Dataflow](http://cloud.google.com/dataflow).
 
-[General usage](https://cloud.google.com/dataflow/getting-started) of Google
-Cloud Dataflow does **not** require use of this repository. Instead:
 
-1. depend directly on a specific
-[version](https://cloud.google.com/dataflow/release-notes/java) of the SDK in
-the [Maven Central 
Repository](http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22com.google.cloud.dataflow%22)
-by adding the following dependency to development
-environments like Eclipse or Apache Maven:
+## Status 
 
-        <dependency>
-          <groupId>com.google.cloud.dataflow</groupId>
-          <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
-          <version>version_number</version>
-        </dependency>
+_**The Apache Beam project is in the process of bootstrapping. This includes 
the creation of project resources, the refactoring of the initial code 
submissions, and the formulation of project documentation, planning, and design 
documents. Please expect a significant amount of churn and breaking changes in 
the near future.**_
 
-1. download the example pipelines from the separate
-[DataflowJavaSDK-examples](https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples)
-repository.
+[Build Status](http://builds.apache.org/job/beam-master)
 
-However, if you'd like to contribute to the SDK, write your own PipelineRunner,
-or just dig in for the fun of it, please stay with us here!
 
-## Status [![Build 
Status](https://travis-ci.org/GoogleCloudPlatform/DataflowJavaSDK.svg?branch=master)](https://travis-ci.org/GoogleCloudPlatform/DataflowJavaSDK)
+## Overview
 
-Both the SDK and the Dataflow Service are generally available, open to all
-developers, and considered stable and fully qualified for production use.
+Beam provides a general approach to expressing [embarrassingly 
parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) data 
processing pipelines and supports three categories of users, each of which have 
relatively disparate backgrounds and needs.
 
-## Overview
+1. _End Users_: Writing pipelines with an existing SDK, running it on an 
existing runner. These users want to focus on writing their application logic 
and have everything else just work.
+2. _SDK Writers_: Developing a Beam SDK targeted at a specific user community 
(Java, Python, Scala, Go, R, graphical, etc). These users are language geeks, 
and  would prefer to be shielded from all the details of various runners and 
their implementations.
+3. _Runner Writers_: Have an execution environment for distributed processing 
and would like to support programs written against the Beam Model. Would prefer 
to be shielded from details of multiple SDKs.
 
-The key concepts in this programming model are:
-
-* 
[`PCollection`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/values/PCollection.java):
-represents a collection of data, which could be bounded or unbounded in size.
-* 
[`PTransform`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/transforms/PTransform.java):
-represents a computation that transforms input PCollections into output
-PCollections.
-* 
[`Pipeline`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/Pipeline.java):
-manages a directed acyclic graph of PTransforms and PCollections that is ready
-for execution.
-* 
[`PipelineRunner`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/PipelineRunner.java):
-specifies where and how the pipeline should execute.
-
-We provide three PipelineRunners:
-
-  1. The 
[`DirectPipelineRunner`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/DirectPipelineRunner.java)
-runs the pipeline on your local machine.
-  2. The 
[`DataflowPipelineRunner`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/DataflowPipelineRunner.java)
-submits the pipeline to the Dataflow Service, where it runs using managed
-resources in the [Google Cloud Platform](https://cloud.google.com) (GCP).
-  3. The 
[`BlockingDataflowPipelineRunner`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/BlockingDataflowPipelineRunner.java)
-submits the pipeline to the Dataflow Service via the `DataflowPipelineRunner`
-and then prints messages about the job status until the execution is complete.
-
-The SDK is built to be extensible and support additional execution environments
-beyond local execution and the Google Cloud Dataflow Service. In partnership
-with [Cloudera](https://www.cloudera.com/), you can run Dataflow pipelines on
-an [Apache Spark](https://spark.apache.org/) backend using the
-[`SparkPipelineRunner`](https://github.com/cloudera/spark-dataflow).
-Additionally, you can run Dataflow pipelines on an
-[Apache Flink](https://flink.apache.org/) backend using the
-[`FlinkPipelineRunner`](https://github.com/dataArtisans/flink-dataflow).
 
-## Getting Started
+### The Beam Model
+
+The model behind Beam evolved from a number of internal Google data processing 
projects, including 
[MapReduce](http://research.google.com/archive/mapreduce.html), 
[FlumeJava](http://research.google.com/pubs/pub35650.html), and 
[Millwheel](http://research.google.com/pubs/pub41378.html). This model was 
originally known as the â[Dataflow 
Model](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)â. 
+
+To learn more about the Beam Model (though still under the original name of 
Dataflow), see the World Beyond Batch: [Streaming 
101](https://wiki.apache.org/incubator/BeamProposal) and [Streaming 
102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101) posts 
on OâReillyâs Radar site, and the [VLDB 2015 
paper](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf).
 
-This repository consists of the following parts:
+The key concepts in the Beam programming model are:
 
-* The 
[`sdk`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk)
-module provides a set of basic Java APIs to program against.
-* The 
[`examples`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples)
-module provides a few samples to get started. We recommend starting with the
-`WordCount` example.
-* The 
[`contrib`](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/contrib)
-directory hosts community-contributed Dataflow modules.
+* `PCollection`: represents a collection of data, which could be bounded or 
unbounded in size.
+* `PTransform`: represents a computation that transforms input PCollections 
into output PCollections.
+* `Pipeline`: manages a directed acyclic graph of PTransforms and PCollections 
that is ready for execution.
+* `PipelineRunner`: specifies where and how the pipeline should execute.
 
-The following command will build both the `sdk` and `example` modules and
-install them in your local Maven repository:
 
-    mvn clean install
+### SDKs
 
-You can speed up the build and install process by using the following options:
+Beam supports multiple language specific SDKs for writing pipelines against 
the Beam Model. 
 
-  1. To skip execution of the unit tests, run:
+Currently, this repository contains the Beam Java SDK, which is in the process 
of evolving from the [Dataflow Java 
SDK](https://github.com/GoogleCloudPlatform/DataflowJavaSDK). The [Dataflow 
Python SDK](https://github.com/GoogleCloudPlatform/DataflowPythonSDK) will also 
become part of Beam in the near future.
 
-        mvn install -DskipTests
+Have ideas for new SDKs or DSLs? See the 
[Jira](https://issues.apache.org/jira/browse/BEAM/component/12328909/).
 
-  2. While iterating on a specific module, use the following command to compile
-  and reinstall it. For example, to reinstall the `examples` module, run:
 
-        mvn install -pl examples
+### Runners
 
-  Be careful, however, as this command will use the most recently installed SDK
-  from the local repository (or Maven Central) even if you have changed it
-  locally.
+Beam supports executing programs on multiple distributed processing backends. 
After the Beam project's initial bootstrapping completes, it will include:
+  1. The `DirectPipelineRunner` runs the pipeline on your local machine.
+  2. The `DataflowPipelineRunner` submits the pipeline to the [Google Cloud 
Dataflow](http://cloud.google.com/dataflow/).
+  3. The `SparkPipelineRunner` runs the pipeline on an Apache Spark cluster. 
See the code that will be donated at 
[cloudera/spark-dataflow](https://github.com/cloudera/spark-dataflow).
+  4. The `FlinkPipelineRunner` runs the pipeline on an Apache Flink cluster. 
See the code that will be donated at 
[dataArtisans/flink-dataflow](https://github.com/dataArtisans/flink-dataflow).
 
-If you are using [Eclipse](https://eclipse.org/) integrated development
-environment (IDE), the
-[Cloud Dataflow Plugin for 
Eclipse](https://cloud.google.com/dataflow/getting-started-eclipse)
-provides tools to create and execute Dataflow pipelines locally and on the
-Dataflow Service.
+Have ideas for new Runners? See the 
[Jira](https://issues.apache.org/jira/browse/BEAM/component/12328916/).
+
+
+## Getting Started
+
+_Coming soon!_
 
-After building and installing, you can execute the `WordCount` and other
-example pipelines by following the instructions in this
-[README](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/README.md).
 
 ## Contact Us
 
-We welcome all usage-related questions on [Stack 
Overflow](http://stackoverflow.com/questions/tagged/google-cloud-dataflow)
-tagged with `google-cloud-dataflow`.
+To get involved in Apache Beam:
+
+* [Subscribe](mailto:[email protected]) or 
[mail](mailto:[email protected]) the 
[[email protected]](http://mail-archives.apache.org/mod_mbox/incubator-beam-user/)
 list.
+* [Subscribe](mailto:[email protected]) or 
[mail](mailto:[email protected]) the 
[[email protected]](http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/)
 list.
+* Report issues on [Jira](https://issues.apache.org/jira/browse/BEAM).
 
-Please use [issue 
tracker](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues)
-on GitHub to report any bugs, comments or questions regarding SDK development.
 
 ## More Information
 
-* [Google Cloud Dataflow](https://cloud.google.com/dataflow/)
-* [Dataflow Concepts and Programming 
Model](https://cloud.google.com/dataflow/model/programming-model)
-* [Java API 
Reference](https://cloud.google.com/dataflow/java-sdk/JavaDoc/index)
+* [Apache Beam](http://beam.incubator.apache.org)
+* [Apache Beam Documentation](http://beam.incubator.apache.org/documentation)

[48/50] [abbrv] incubator-beam git commit: Update README for initial code drop.

Reply via email to