This is an automated email from the ASF dual-hosted git repository. mergebot-role pushed a commit to branch mergebot in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit e2ab466cb8f9d2b1db338eaa08c2d4d16469523a Author: Pei He <p...@apache.org> AuthorDate: Fri Sep 8 15:18:11 2017 +0800 [BEAM-2839, BEAM-2838] Add MapReduce runner to Beam asf-site. --- src/_data/capability-matrix.yml | 114 +++++++++++++++++++++++++++++++++ src/documentation/runners/mapreduce.md | 80 +++++++++++++++++++++++ src/get-started/beam-overview.md | 2 + src/images/logos/runners/mapreduce.png | Bin 0 -> 37095 bytes 4 files changed, 196 insertions(+) diff --git a/src/_data/capability-matrix.yml b/src/_data/capability-matrix.yml index 775e0da..c4bbb3b 100644 --- a/src/_data/capability-matrix.yml +++ b/src/_data/capability-matrix.yml @@ -11,6 +11,8 @@ columns: name: Apache Apex - class: gearpump name: Apache Gearpump + - class: mapreduce + name: MapReduce categories: - description: What is being computed? @@ -46,6 +48,10 @@ categories: l1: 'Yes' l2: fully supported l3: Gearpump wraps the per-element transformation function into processor execution. + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: GroupByKey values: - class: model @@ -72,6 +78,10 @@ categories: l1: 'Yes' l2: fully supported l3: "Use Gearpump's groupBy and window for key grouping and translate Beam's windowing and triggering to Gearpump's internal implementation." + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: Flatten values: - class: model @@ -98,6 +108,10 @@ categories: l1: 'Yes' l2: fully supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: Combine values: - class: model @@ -124,6 +138,10 @@ categories: l1: 'Yes' l2: fully supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: Composite Transforms values: - class: model @@ -150,6 +168,10 @@ categories: l1: 'Partially' l2: supported via inlining l3: '' + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: Side Inputs values: - class: model @@ -176,6 +198,10 @@ categories: l1: 'Yes' l2: fully supported l3: Implemented by merging side input as a normal stream in Gearpump + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: Source API values: - class: model @@ -202,6 +228,10 @@ categories: l1: 'Yes' l2: fully supported l3: '' + - class: mapreduce + l1: 'Partially' + l2: bounded source only + l3: '' - name: Splittable DoFn values: - class: model @@ -228,6 +258,10 @@ categories: l1: 'No' l2: not implemented l3: + - class: mapreduce + l1: 'No' + l2: not implemented + l3: - name: Metrics values: - class: model @@ -254,6 +288,10 @@ categories: l1: 'No' l2: '' l3: not implemented + - class: mapreduce + l1: 'Partially' + l2: Only attempted counters are supported + l3: '' - name: Stateful Processing values: - class: model @@ -280,6 +318,10 @@ categories: l1: 'No' l2: not implemented l3: '' + - class: mapreduce + l1: 'Partially' + l2: non-merging windows + l3: '' - description: Where in event time? anchor: where color-b: '37d' @@ -313,6 +355,10 @@ categories: l1: 'Yes' l2: supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: supported + l3: '' - name: Fixed windows values: - class: model @@ -339,6 +385,10 @@ categories: l1: 'Yes' l2: supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: supported + l3: '' - name: Sliding windows values: - class: model @@ -365,6 +415,10 @@ categories: l1: 'Yes' l2: supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: supported + l3: '' - name: Session windows values: - class: model @@ -391,6 +445,10 @@ categories: l1: 'Yes' l2: supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: supported + l3: '' - name: Custom windows values: - class: model @@ -417,6 +475,10 @@ categories: l1: 'Yes' l2: supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: supported + l3: '' - name: Custom merging windows values: - class: model @@ -443,6 +505,10 @@ categories: l1: 'Yes' l2: supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: supported + l3: '' - name: Timestamp control values: - class: model @@ -469,6 +535,10 @@ categories: l1: 'Yes' l2: supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: supported + l3: '' - description: When in processing time? @@ -505,6 +575,10 @@ categories: l1: 'No' l2: '' l3: '' + - class: mapreduce + l1: 'Yes' + l2: Intermediate trigger firings are effectively meaningless. + l3: '' - name: Event-time triggers values: @@ -532,6 +606,10 @@ categories: l1: 'Yes' l2: fully supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: Currently watermark progress jumps from the beginning of time to the end of time once the input has been fully consumed, thus no additional triggering granularity is available. + l3: '' - name: Processing-time triggers values: @@ -559,6 +637,10 @@ categories: l1: 'No' l2: '' l3: '' + - class: mapreduce + l1: 'Yes' + l2: From the perspective of triggers, processing time currently jumps from the beginning of time to the end of time once the input has been fully consumed, thus no additional triggering granularity is available. + l3: '' - name: Count triggers values: @@ -586,6 +668,10 @@ categories: l1: 'No' l2: '' l3: '' + - class: mapreduce + l1: 'Yes' + l2: Elements are processed in the largest bundles possible, so count-based triggers are effectively meaningless. + l3: '' - name: '[Meta]data driven triggers' values: @@ -614,6 +700,10 @@ categories: l1: 'No' l2: pending model support l3: + - class: mapreduce + l1: 'No' + l2: pending model support + l3: - name: Composite triggers values: @@ -641,6 +731,10 @@ categories: l1: 'No' l2: '' l3: '' + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: Allowed lateness values: @@ -668,6 +762,10 @@ categories: l1: 'Yes' l2: fully supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: No data is ever late. + l3: '' - name: Timers values: @@ -695,6 +793,10 @@ categories: l1: 'No' l2: not implemented l3: '' + - class: mapreduce + l1: 'Partially' + l2: not implemented + l3: '' - description: How do refinements relate? anchor: how @@ -730,6 +832,10 @@ categories: l1: 'Yes' l2: fully supported l3: '' + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: Accumulating values: @@ -757,6 +863,10 @@ categories: l1: 'No' l2: '' l3: '' + - class: mapreduce + l1: 'Yes' + l2: fully supported + l3: '' - name: 'Accumulating & Retracting' values: @@ -785,3 +895,7 @@ categories: l1: 'No' l2: pending model support l3: '' + - class: mapreduce + l1: 'No' + l2: pending model support + l3: '' diff --git a/src/documentation/runners/mapreduce.md b/src/documentation/runners/mapreduce.md new file mode 100644 index 0000000..c88870e --- /dev/null +++ b/src/documentation/runners/mapreduce.md @@ -0,0 +1,80 @@ +--- +layout: default +title: "Apache Hadoop MapReduce Runner" +permalink: /documentation/runners/mapreduce/ +redirect_from: /learn/runners/mapreduce/ +--- +# Using the Apache Hadoop MapReduce Runner + +The Apache Hadoop MapReduce Runner can be used to execute Beam pipelines using [Apache Hadoop](http://hadoop.apache.org/). + +The [Beam Capability Matrix]({{ site.baseurl }}/documentation/runners/capability-matrix/) documents the currently supported capabilities of the Apache Hadoop MapReduce Runner. + +## Apache Hadoop MapReduce Runner prerequisites and setup +You need to have an Apache Hadoop environment with either [Single Node Setup](https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html) or [Cluster Setup](https://hadoop.apache.org/docs/r1.2.1/cluster_setup.html) + +The Apache Hadoop MapReduce runner currently supports Apache Hadoop 2.8.1 version. + +You can add a dependency on the latest version of the Apache Hadoop MapReduce runner by adding to your pom.xml the following: +```java +<dependency> + <groupId>org.apache.beam</groupId> + <artifactId>beam-runners-mapreduce</artifactId> + <version>{{ site.release_latest }}</version> +</dependency> +``` + +## Deploying Apache Hadoop MapReduce with your application +To execute in a local hadoop environment, use this command: +``` +$ mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ + -Pmapreduce-runner \ + -Dexec.args="--runner=MapReduceRunner \ + --inputFile=/path/to/pom.xml \ + --output=/path/to/counts \ + --fileOutputDir=<directory for intermediate outputs>" +``` + +To execute in a hadoop cluster, you need to package your program along will all dependencies in a so-called fat jar. + +If you follow along the [Beam Quickstart]({{ site.baseurl }}/get-started/quickstart/) this is the command that you can run: +``` +$ mvn package -Pflink-runner +``` + +For actually running the pipeline you would use this command +``` +$ yarn jar word-count-beam-bundled-0.1.jar \ + org.apache.beam.examples.WordCount \ + --runner=MapReduceRunner \ + --inputFile=/path/to/pom.xml \ + --output=/path/to/counts \ + --fileOutputDir=<directory for intermediate outputs>" +``` + +## Pipeline options for the Apache Hadoop MapReduce Runner + +When executing your pipeline with the Apache Hadoop MapReduce Runner, you should consider the following pipeline options. + +<table class="table table-bordered"> +<tr> + <th>Field</th> + <th>Description</th> + <th>Default Value</th> +</tr> +<tr> + <td><code>runner</code></td> + <td>The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.</td> + <td>Set to <code>MapReduceRunner</code> to run using the Apache Hadoop MapReduce.</td> +</tr> +<tr> + <td><code>jarClass</code></td> + <td>The jar class of the user Beam program.</td> + <td>JarClassInstanceFactory.class</td> +</tr> +<tr> + <td><code>fileOutputDir</code></td> + <td>The directory for files output.</td> + <td>"/tmp/mapreduce/"</td> +</tr> +</table> diff --git a/src/get-started/beam-overview.md b/src/get-started/beam-overview.md index 1d3bbc6..e320c3f 100644 --- a/src/get-started/beam-overview.md +++ b/src/get-started/beam-overview.md @@ -36,6 +36,8 @@ Beam currently supports Runners that work with the following distributed process alt="Apache Flink"> * Apache Gearpump (incubating) <img src="{{ site.baseurl }}/images/logos/runners/gearpump.png" alt="Apache Gearpump"> +* Apache Hadoop MapReduce <img src="{{ site.baseurl }}/images/logos/runners/mapreduce.png" + alt="Apache Hadoop MapReduce"> * Apache Spark <img src="{{ site.baseurl }}/images/logos/runners/spark.png" alt="Apache Spark"> * Google Cloud Dataflow <img src="{{ site.baseurl }}/images/logos/runners/dataflow.png" diff --git a/src/images/logos/runners/mapreduce.png b/src/images/logos/runners/mapreduce.png new file mode 100644 index 0000000..78af2c6 Binary files /dev/null and b/src/images/logos/runners/mapreduce.png differ -- To stop receiving notification emails like this one, please contact "commits@beam.apache.org" <commits@beam.apache.org>.