[
https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354597#comment-14354597
]
ASF GitHub Bot commented on FLINK-1219:
---------------------------------------
Github user rmetzger commented on a diff in the pull request:
https://github.com/apache/flink/pull/189#discussion_r26108160
--- Diff: docs/flink_on_tez_guide.md ---
@@ -0,0 +1,293 @@
+---
+title: "Running Flink on YARN leveraging Tez"
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+
+<a href="#top"></a>
+
+## Introduction
+
+You can run Flink using Tez as an execution environment. Flink on Tez
+is currently included in *flink-staging* in alpha. All classes are
+localted in the *org.apache.flink.tez* package.
+
+## Why Flink on Tez
+
+[Apache Tez](tez.apache.org) is a scalable data processing
+platform. Tez provides an API for specifying a directed acyclic
+graph (DAG), and functionality for placing the DAG vertices in YARN
+containers, as well as data shuffling. In Flink's architecture,
+Tez is at about the same level as Flink's network stack. While Flink's
+network stack focuses heavily on low latency in order to support
+pipelining, data streaming, and iterative algorithms, Tez
+focuses on scalability and elastic resource usage.
+
+Thus, by replacing Flink's network stack with Tez, users can get
scalability
+and elastic resource usage in shared clusters while retaining Flink's
+APIs, optimizer, and runtime algorithms (local sorts, hash tables, etc).
+
+Flink programs can run almost unmodified using Tez as an execution
+environment. Tez supports local execution (e.g., for debugging), and
+remote execution on YARN.
+
+
+## Local execution
+
+The `LocalTezEnvironment` can be used run programs using the local
+mode provided by Tez. This is for example WordCount using Tez local mode.
+It is identical to a normal Flink WordCount, except that the
`LocalTezEnvironment` is used.
+To run in local Tez mode, you can simply run a Flink on Tez program
+from your IDE (e.g., right click and run).
+
+{% highlight java %}
+public class WordCountExample {
+ public static void main(String[] args) throws Exception {
+ final LocalTezEnvironment env = LocalTezEnvironment.create();
+
+ DataSet<String> text = env.fromElements(
+ "Who's there?",
+ "I think I hear them. Stand, ho! Who's there?");
+
+ DataSet<Tuple2<String, Integer>> wordCounts = text
+ .flatMap(new LineSplitter())
+ .groupBy(0)
+ .sum(1);
+
+ wordCounts.print();
+
+ env.execute("Word Count Example");
+ }
+
+ public static class LineSplitter implements FlatMapFunction<String,
Tuple2<String, Integer>> {
+ @Override
+ public void flatMap(String line, Collector<Tuple2<String,
Integer>> out) {
+ for (String word : line.split(" ")) {
+ out.collect(new Tuple2<String, Integer>(word, 1));
+ }
+ }
+ }
+}
+{% endhighlight %}
+
+## YARN execution
+
+### Setup
+
+- Install Tez on your Hadoop 2 cluster following the instructions from the
+ [Apache Tez website](http://tez.apache.org/install.html). If you are
able to run
+ the examples that ship with Tez, then Tez has been successfully
installed.
+
+- Currently, you need to build Flink yourself to obtain Flink on Tez
+ (the reason is a Hadoop version compatibility: Tez releases artifacts
+ on Maven central with a Hadoop 2.6.0 dependency). Build Flink
+ using `mvn -DskipTests clean package -Dhadoop.version=X.X.X
-Pinclude-tez`.
+ Make sure that the Hadoop version matches the version that Tez uses.
+ Obtain the jar file contained in the Flink distribution under
+ `flink-staging/flink-tez/target/flink-tez-x.y.z-flink-fat-jar.jar`
+ and upload it to some directory in HDFS. E.g., to upload the file
+ to the directory `/apps`, execute
+ {% highlight bash %}
+ $ hadoop fs -put /path/to/flink-tez-x.y.z-flink-fat-jar.jar /apps
+ {% endhighlight %}
+
+- Edit the tez-site.xml configuration file, adding an entry that points to
the
+ location of the file. E.g., assuming that the file is in the directory
`/apps/`,
+ add the following entry to tez-site.xml
+ ~~~<property>
+ <name>tez.aux.uris</name>
+
<value>${fs.default.name}/apps/flink-tez-x.y.z-flink-fat-jar.jar</value>
+ </property>
+ ~~~
+
+- At this point, you should be able to run the pre-packaged examples,
e.g., run WordCount as:
+ {% highlight bash %}
+ $ hadoop jar /path/to/flink-tez-x.y.z-flink-fat-jar.jar
org.apache.flink.tez.ExampleDriver \
--- End diff --
The classname `org.apache.flink.tez.ExampleDriver` is wrong.
> Add support for Apache Tez as execution engine
> ----------------------------------------------
>
> Key: FLINK-1219
> URL: https://issues.apache.org/jira/browse/FLINK-1219
> Project: Flink
> Issue Type: New Feature
> Reporter: Kostas Tzoumas
> Assignee: Kostas Tzoumas
>
> This is an umbrella issue to track Apache Tez support.
> The goal is to be able to run unmodified Flink programs as Apache Tez jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)