Repository: incubator-beam Updated Branches: refs/heads/master ef1e32dee -> 659f0b877
[BEAM-113] Update Spark runner README Just a couple of more changes Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/b0db3131 Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/b0db3131 Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/b0db3131 Branch: refs/heads/master Commit: b0db313199fcb01eec04457c2a04103b7a218a1a Parents: ef1e32d Author: Sela <[email protected]> Authored: Wed Mar 16 22:22:35 2016 +0200 Committer: Sela <[email protected]> Committed: Thu Mar 17 18:54:45 2016 +0200 ---------------------------------------------------------------------- runners/spark/README.md | 112 ++++++++++++++++++++++++------------------- 1 file changed, 63 insertions(+), 49 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/b0db3131/runners/spark/README.md ---------------------------------------------------------------------- diff --git a/runners/spark/README.md b/runners/spark/README.md index ccf8516..1d75b35 100644 --- a/runners/spark/README.md +++ b/runners/spark/README.md @@ -1,55 +1,78 @@ -spark-dataflow -============== +Spark Beam Runner (Spark-Runner) +================================ ## Intro -Spark-dataflow allows users to execute data pipelines written against the Google Cloud Dataflow API -with Apache Spark. Spark-dataflow is an early prototype, and we'll be working on it continuously. -If this project interests you, we welcome issues, comments, and (especially!) pull requests. -To get an idea of what we have already identified as -areas that need improvement, checkout the issues listed in the github repo. +The Spark-Runner allows users to execute data pipelines written against the Apache Beam API +with Apache Spark. This runner allows to execute both batch and streaming pipelines on top of the Spark engine. -## Motivation +## Overview -We had two primary goals when we started working on Spark-dataflow: +### Features -1. *Provide portability for data pipelines written for Google Cloud Dataflow.* Google makes -it really easy to get started writing pipelines against the Dataflow API, but they wanted -to be sure that creating a pipeline using their tools would not lock developers in to their -platform. A Spark-based implementation of Dataflow means that you can take your pipeline -logic with you wherever you go. This also means that any new machine learning and anomaly -detection algorithms that are developed against the Dataflow API are available to everyone, -regardless of their underlying execution platform. +- ParDo +- GroupByKey +- Combine +- Windowing +- Flatten +- View +- Side inputs/outputs +- Encoding + +### Sources and Sinks + +- Text +- Hadoop +- Avro +- Kafka + +### Fault-Tolerance + +The Spark runner fault-tolerance guarantees the same guarantees as [Apache Spark](http://spark.apache.org/). + +### Monitoring + +The Spark runner supports monitoring via Beam Aggregators implemented on top of Spark's [Accumulators](http://spark.apache.org/docs/latest/programming-guide.html#accumulators). +Spark also provides a web UI for monitoring, more details [here](http://spark.apache.org/docs/latest/monitoring.html). + +## Beam Model support + +### Batch + +The Spark runner provides support for batch processing of Beam bounded PCollections as Spark [RDD](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds)s. + +### Streaming + +The Spark runner currently provides partial support for stream processing of Beam unbounded PCollections as Spark [DStream](http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams)s. +Current implementation of *Windowing* takes the first window size in the pipeline and treats it as the Spark "batch interval", while following windows will be treated as *Processing Time* windows. +Further work is required to provide full support for the Beam Model *event-time* and *out-of-order* stream processing. + +### issue tracking + +See [Beam JIRA](https://issues.apache.org/jira/browse/BEAM) (runner-spark) -2. *Experiment with new data pipeline design patterns.* The Dataflow API has a number of -interesting ideas, especially with respect to the unification of batch and stream data -processing into a single API that maps into two separate engines. The Dataflow streaming -engine, based on Google's [Millwheel](http://research.google.com/pubs/pub41378.html), does -not have a direct open source analogue, and we wanted to understand how to replicate its -functionality using frameworks like Spark Streaming. ## Getting Started -The Maven coordinates of the current version of this project are: +To get the latest version of the Spark Runner, first clone the Beam repository: + + git clone https://github.com/apache/incubator-beam - <groupId>com.cloudera.dataflow.spark</groupId> - <artifactId>spark-dataflow</artifactId> - <version>0.4.2</version> -and are hosted in Cloudera's repository at: +Then switch to the newly created directory and run Maven to build the Apache Beam: + + cd incubator-beam + mvn clean install -DskipTests - <repository> - <id>cloudera.repo</id> - <url>https://repository.cloudera.com/artifactory/cloudera-repos</url> - </repository> +Now Apache Beam and the Spark Runner are installed in your local maven repository. -If we wanted to run a dataflow pipeline with the default options of a single threaded spark +If we wanted to run a Beam pipeline with the default options of a single threaded Spark instance in local mode, we would do the following: Pipeline p = <logic for pipeline creation > EvaluationResult result = SparkPipelineRunner.create().run(p); -To create a pipeline runner to run against a different spark cluster, with a custom master url we +To create a pipeline runner to run against a different Spark cluster, with a custom master url we would do the following: Pipeline p = <logic for pipeline creation > @@ -62,7 +85,11 @@ would do the following: First download a text document to use as input: curl http://www.gutenberg.org/cache/epub/1128/pg1128.txt > /tmp/kinglear.txt + +Switch to the Spark runner directory: + cd runners/spark + Then run the [word count example][wc] from the SDK using a single threaded Spark instance in local mode: @@ -77,11 +104,10 @@ Check the output by running: __Note: running examples using `mvn exec:exec` only works for Spark local mode at the moment. See the next section for how to run on a cluster.__ -[wc]: https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/WordCount.java - +[wc]: https://github.com/apache/incubator-beam/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/WordCount.java ## Running on a Cluster -Spark Dataflow pipelines can be run on a cluster using the `spark-submit` command. +Spark Beam pipelines can be run on a cluster using the `spark-submit` command. First copy a text document to HDFS: @@ -93,21 +119,9 @@ Then run the word count example using Spark submit with the `yarn-client` master spark-submit \ --class com.google.cloud.dataflow.examples.WordCount \ --master yarn-client \ - target/spark-dataflow-*-spark-app.jar \ + target/spark-runner-*-spark-app.jar \ --inputFile=kinglear.txt --output=out --runner=SparkPipelineRunner --sparkMaster=yarn-client Check the output by running: hadoop fs -tail out-00000-of-00002 - -## How to Release - -Committers can release the project using the standard [Maven Release Plugin](http://maven.apache.org/maven-release/maven-release-plugin/) commands: - - mvn release:prepare - mvn release:perform -Darguments="-Dgpg.passphrase=XXX" - -Note that you will need a [public GPG key](http://www.apache.org/dev/openpgp.html). - -[](https://travis-ci.org/cloudera/spark-dataflow) -[](https://codecov.io/github/cloudera/spark-dataflow?branch=master)
