[ https://issues.apache.org/jira/browse/FLINK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354597#comment-14354597 ]
ASF GitHub Bot commented on FLINK-1219: --------------------------------------- Github user rmetzger commented on a diff in the pull request: https://github.com/apache/flink/pull/189#discussion_r26108160 --- Diff: docs/flink_on_tez_guide.md --- @@ -0,0 +1,293 @@ +--- +title: "Running Flink on YARN leveraging Tez" +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +* This will be replaced by the TOC +{:toc} + + +<a href="#top"></a> + +## Introduction + +You can run Flink using Tez as an execution environment. Flink on Tez +is currently included in *flink-staging* in alpha. All classes are +localted in the *org.apache.flink.tez* package. + +## Why Flink on Tez + +[Apache Tez](tez.apache.org) is a scalable data processing +platform. Tez provides an API for specifying a directed acyclic +graph (DAG), and functionality for placing the DAG vertices in YARN +containers, as well as data shuffling. In Flink's architecture, +Tez is at about the same level as Flink's network stack. While Flink's +network stack focuses heavily on low latency in order to support +pipelining, data streaming, and iterative algorithms, Tez +focuses on scalability and elastic resource usage. + +Thus, by replacing Flink's network stack with Tez, users can get scalability +and elastic resource usage in shared clusters while retaining Flink's +APIs, optimizer, and runtime algorithms (local sorts, hash tables, etc). + +Flink programs can run almost unmodified using Tez as an execution +environment. Tez supports local execution (e.g., for debugging), and +remote execution on YARN. + + +## Local execution + +The `LocalTezEnvironment` can be used run programs using the local +mode provided by Tez. This is for example WordCount using Tez local mode. +It is identical to a normal Flink WordCount, except that the `LocalTezEnvironment` is used. +To run in local Tez mode, you can simply run a Flink on Tez program +from your IDE (e.g., right click and run). + +{% highlight java %} +public class WordCountExample { + public static void main(String[] args) throws Exception { + final LocalTezEnvironment env = LocalTezEnvironment.create(); + + DataSet<String> text = env.fromElements( + "Who's there?", + "I think I hear them. Stand, ho! Who's there?"); + + DataSet<Tuple2<String, Integer>> wordCounts = text + .flatMap(new LineSplitter()) + .groupBy(0) + .sum(1); + + wordCounts.print(); + + env.execute("Word Count Example"); + } + + public static class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> { + @Override + public void flatMap(String line, Collector<Tuple2<String, Integer>> out) { + for (String word : line.split(" ")) { + out.collect(new Tuple2<String, Integer>(word, 1)); + } + } + } +} +{% endhighlight %} + +## YARN execution + +### Setup + +- Install Tez on your Hadoop 2 cluster following the instructions from the + [Apache Tez website](http://tez.apache.org/install.html). If you are able to run + the examples that ship with Tez, then Tez has been successfully installed. + +- Currently, you need to build Flink yourself to obtain Flink on Tez + (the reason is a Hadoop version compatibility: Tez releases artifacts + on Maven central with a Hadoop 2.6.0 dependency). Build Flink + using `mvn -DskipTests clean package -Dhadoop.version=X.X.X -Pinclude-tez`. + Make sure that the Hadoop version matches the version that Tez uses. + Obtain the jar file contained in the Flink distribution under + `flink-staging/flink-tez/target/flink-tez-x.y.z-flink-fat-jar.jar` + and upload it to some directory in HDFS. E.g., to upload the file + to the directory `/apps`, execute + {% highlight bash %} + $ hadoop fs -put /path/to/flink-tez-x.y.z-flink-fat-jar.jar /apps + {% endhighlight %} + +- Edit the tez-site.xml configuration file, adding an entry that points to the + location of the file. E.g., assuming that the file is in the directory `/apps/`, + add the following entry to tez-site.xml + ~~~<property> + <name>tez.aux.uris</name> + <value>${fs.default.name}/apps/flink-tez-x.y.z-flink-fat-jar.jar</value> + </property> + ~~~ + +- At this point, you should be able to run the pre-packaged examples, e.g., run WordCount as: + {% highlight bash %} + $ hadoop jar /path/to/flink-tez-x.y.z-flink-fat-jar.jar org.apache.flink.tez.ExampleDriver \ --- End diff -- The classname `org.apache.flink.tez.ExampleDriver` is wrong. > Add support for Apache Tez as execution engine > ---------------------------------------------- > > Key: FLINK-1219 > URL: https://issues.apache.org/jira/browse/FLINK-1219 > Project: Flink > Issue Type: New Feature > Reporter: Kostas Tzoumas > Assignee: Kostas Tzoumas > > This is an umbrella issue to track Apache Tez support. > The goal is to be able to run unmodified Flink programs as Apache Tez jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)