[GitHub] [flink] tillrohrmann commented on a change in pull request #14238: [FLINK-20347] Rework YARN documentation page

GitBox Fri, 27 Nov 2020 06:05:54 -0800


tillrohrmann commented on a change in pull request #14238:
URL: https://github.com/apache/flink/pull/14238#discussion_r531621048




##########
File path: docs/deployment/resource-providers/yarn.md
##########
@@ -0,0 +1,200 @@
+---
+title:  "Apache Hadoop YARN"
+nav-title: YARN
+nav-parent_id: resource_providers
+nav-pos: 4
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Getting Started
+
+This *Getting Started* section guides you through setting up a fully 
functional Flink Cluster on YARN.
+
+### Introduction
+
+[Apache Hadoop 
YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
 is a resource provider popular with many data processing frameworks.
+Flink services are submitted to YARN's ResourceManager, which spawns 
containers on machines managed by YARN NodeManagers. Flink deploys its 
JobManager and TaskManager instances into such containers.
+
+Flink can dynamically allocate and de-allocate TaskManager resources depending 
on the number of processing slots required by the job(s) running on the 
JobManager.
+
+### Preparation
+
+This *Getting Started* section assumes a functional YARN environment, starting 
from version 2.4.1. YARN environments are provided most conveniently through 
services such as Amazon EMR, Google Cloud DataProc or products like Cloudera. 
[Manually setting up a YARN environment 
locally](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html)
 or [on a 
cluster](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html)
 is not recommended for following through this *Getting Started* tutorial. 
+
+
+- Make sure your YARN cluster is ready for accepting Flink applications by 
running `yarn top`. It should show no error messages.
+- Download a recent Flink distribution from the [download page]({{ 
site.download_url }}) and unpack it.
+- **Important** Make sure that the `HADOOP_CLASSPATH` environment variable is 
set up (it can be checked by running `echo $HADOOP_CLASSPATH`). If not, set it 
up using 
+
+{% highlight bash %}
+export HADOOP_CLASSPATH=`hadoop classpath`
+{% endhighlight %}
+
+
+### Starting a Flink Session on YARN
+
+Once you've made sure that the `HADOOP_CLASSPATH` environment variable is set, 
we can launch a Flink on YARN session, and submit an example job:
+
+{% highlight bash %}
+
+# we assume to be in the root directory of the unzipped Flink distribution
+
+# (0) export HADOOP_CLASSPATH
+export HADOOP_CLASSPATH=`hadoop classpath`
+
+# (1) Start YARN Session
+./bin/yarn-session.sh --detached
+
+# (2) You can now access the Flink Web Interface through the URL printed in 
the last lines of the command output, or through the YARN ResourceManager web 
UI.
+
+# (3) Submit example job
+./bin/flink run ./examples/streaming/TopSpeedWindowing.jar
+
+# (4) Stop YARN session (replace the application id based on the output of the 
yarn-session.sh command)
+echo "stop" | ./bin/yarn-session.sh -yid application_XXXXX_XXX
+{% endhighlight %}
+
+Congratulations! You have successfully run a Flink application by deploying 
Flink on YARN.
+
+
+## Deployment Modes Supported by Flink on YARN
+
+For production use, we recommend deploying Flink Applications in the [Per-job 
or Application Mode]({% link deployment/index.md %}#deployment-modes), as these 
modes provide a better isolation for the Applications.
+
+### Application Mode
+
+Application Mode will launch a Flink cluster on YARN, where the main() method 
of the application jar gets executed on the JobManager in YARN.
+The cluster will shut down as soon as the application has finished. You can 
manually stop the cluster using `yarn application -kill <ApplicationId>` or by 
cancelling the Flink job.
+
+{% highlight bash %}
+./bin/flink run-application -t yarn-application 
./examples/streaming/TopSpeedWindowing.jar
+{% endhighlight %}
+
+To unlock the full potential of the application mode, consider using it with 
the `yarn.provided.lib.dirs` configuration option
+and pre-upload your application jar to a location accessible by all nodes in 
your cluster. In this case, the 
+command could look like: 
+
+{% highlight bash %}
+./bin/flink run-application -t yarn-application \
+       -Dyarn.provided.lib.dirs="hdfs://myhdfs/my-remote-flink-dist-dir" \
+       hdfs://myhdfs/jars/my-application.jar
+{% endhighlight %}
+
+The above will allow the job submission to be extra lightweight as the needed 
Flink jars and the application jar
+are  going to be picked up by the specified remote locations rather than be 
shipped to the cluster by the 
+client.
+
+### Per-Job Cluster Mode
+
+The Per-job Cluster mode will launch a Flink cluster on YARN, then run the 
provided application jar locally and finally submit the JobGraph to the 
JobManager on YARN. If you pass the `--detached` argument, the client will stop 
once the submission is accepted.
+
+The YARN cluster will stop once the job has stopped.
+
+{% highlight bash %}
+./bin/flink run -m yarn-cluster --detached 
./examples/streaming/TopSpeedWindowing.jar
+{% endhighlight %}
+
+With `bin/flink run -m yarn-cluster`, you can run the command line interface 
in YARN cluster mode, allowing to submit jobs to an ad-hoc YARN cluster. The 
command line options of the YARN session described below are then also 
available. They are prefixed with a `y` or `yarn` (for the long argument 
options).
+
+### Session Mode
+
+We describe deployment with the Session Mode in the [Getting 
Started](#getting-started) guide at the top of the page.
+
+The Session Mode has two operation modes:
+- **attached mode** (default): The yarn-session.sh client submits the Flink 
clsuter to YARN, but the client keeps running, tracking the state of the 
cluster. If the cluster fails, the client will show the error. If the client 
gets terminated, it will signal the cluster to shut down as well.
+- **detached mode**: The yarn-session.sh client submits the Flink cluster to 
YARN, then the client returns. Another invocation of the client, or YARN tools 
is needed to stop the Flink cluster.
+
+The session mode will create a hidden YARN properties file in 
`/tmp/.yarn-properties-<username>`, which will be picked up for cluster 
discovery by the command line interface when submitting a job.

Review comment:
       Maybe also mention how to explicitly specify a yarn cluster when 
submitting a job (e.g. if you have multiple Flink session cluster running).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] tillrohrmann commented on a change in pull request #14238: [FLINK-20347] Rework YARN documentation page

Reply via email to