METRON-1770 Add Docs for Running the Profiler with Spark on YARN (nickwallen) 
closes apache/metron#1189


Project: http://git-wip-us.apache.org/repos/asf/metron/repo
Commit: http://git-wip-us.apache.org/repos/asf/metron/commit/f83f0ac0
Tree: http://git-wip-us.apache.org/repos/asf/metron/tree/f83f0ac0
Diff: http://git-wip-us.apache.org/repos/asf/metron/diff/f83f0ac0

Branch: refs/heads/master
Commit: f83f0ac06622e091a09d9f256f817e7235c63e53
Parents: cad2f40
Author: nickwallen <n...@nickallen.org>
Authored: Wed Sep 19 10:01:50 2018 -0400
Committer: nickallen <nickal...@apache.org>
Committed: Wed Sep 19 10:01:50 2018 -0400

----------------------------------------------------------------------
 .../metron-profiler-spark/README.md             | 94 ++++++++++++++------
 .../src/main/config/batch-profiler.properties   |  8 +-
 2 files changed, 76 insertions(+), 26 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/metron/blob/f83f0ac0/metron-analytics/metron-profiler-spark/README.md
----------------------------------------------------------------------
diff --git a/metron-analytics/metron-profiler-spark/README.md 
b/metron-analytics/metron-profiler-spark/README.md
index d137e51..3d7017c 100644
--- a/metron-analytics/metron-profiler-spark/README.md
+++ b/metron-analytics/metron-profiler-spark/README.md
@@ -22,8 +22,8 @@ This project allows profiles to be executed using [Apache 
Spark](https://spark.a
 * [Introduction](#introduction)
 * [Getting Started](#getting-started)
 * [Installation](#installation)
-* [Configuring the Profiler](#configuring-the-profiler)
 * [Running the Profiler](#running-the-profiler)
+* [Configuring the Profiler](#configuring-the-profiler)
 
 ## Introduction
 
@@ -129,6 +129,73 @@ The Batch Profiler requires Spark version 2.3.0+.
     find ./ -name "metron-profiler-spark*.deb"
     ```
 
+## Running the Profiler
+
+A script located at `$METRON_HOME/bin/start_batch_profiler.sh` has been 
provided to simplify running the Batch Profiler.  This script makes the 
following assumptions.
+
+  * The script builds the profiles defined in 
`$METRON_HOME/config/zookeeper/profiler.json`.
+
+  * The properties defined in `$METRON_HOME/config/batch-profiler.properties` 
are passed to both the Profiler and Spark.  You can define both Spark and 
Profiler properties in this same file.
+
+  * The script assumes that Spark is installed at 
`/usr/hdp/current/spark2-client`.  This can be overridden if you define an 
environment variable called `SPARK_HOME` prior to executing the script.
+
+### Advanced Usage
+
+The Batch Profiler may also be started using `spark-submit` as follows.  See 
the Spark Documentation for more information about 
[`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit).
+
+```
+${SPARK_HOME}/bin/spark-submit \
+    --class org.apache.metron.profiler.spark.cli.BatchProfilerCLI \
+    --properties-file ${SPARK_PROPS_FILE} \
+    ${METRON_HOME}/lib/metron-profiler-spark-*.jar \
+    --config ${PROFILER_PROPS_FILE} \
+    --profiles ${PROFILES_FILE}
+```
+
+The Batch Profiler accepts the following arguments when run from the command 
line as shown above.  All arguments following the Profiler jar are passed to 
the Profiler.  All argument preceeding the Profiler jar are passed to Spark.
+
+| Argument         | Description
+|---               |---
+| -p, --profiles   | The path to a file containing the profile definitions.
+| -c, --config     | The path to the profiler properties file.
+| -g, --globals    | The path to a properties file containing global 
properties.
+| -h, --help       | Print the help text.
+
+### Spark Execution
+
+Spark supports a number of different [cluster 
managers](https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types).
  The underlying cluster manager is transparent to the Profiler.  To run the 
Profiler on a particular cluster manager, it is just a matter of setting the 
appropriate options as defined in the Spark documentation.
+
+#### Local Mode
+
+By default, the Batch Profiler instructs Spark to run in local mode.  This 
will run all of the Spark execution components within a single JVM.  This mode 
is only useful for testing with a limited set of data.
+
+`$METRON_HOME/config/batch-profiler.properties`
+```
+spark.master=local
+```
+
+#### Spark on YARN
+
+To run the Profiler using [Spark on 
YARN](https://spark.apache.org/docs/latest/running-on-yarn.html#running-spark-on-yarn),
 at a minimum edit the value of `spark.master` as shown. In many cases it also 
makes sense to set the YARN [deploy 
mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn)
 to `cluster`.
+
+`$METRON_HOME/config/batch-profiler.properties`
+```
+spark.master=yarn
+spark.submit.deployMode=cluster
+```
+
+See the Spark documentation for information on how to further control the 
execution of Spark on YARN.  Any of [these 
properties](http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties)
 can be added to the Profiler properties file.
+
+The following command can be useful to review the logs generated when the 
Profiler is executed on YARN.
+```
+yarn logs -applicationId <application-id>
+```
+
+#### Kerberos
+
+See the Spark documentation for information on running the Batch Profiler in a 
[secure, kerberized 
cluster](https://spark.apache.org/docs/latest/running-on-yarn.html#running-in-a-secure-cluster).
+
+
 ## Configuring the Profiler
 
 By default, the configuration for the Batch Profiler is stored in the local 
filesystem at `$METRON_HOME/config/batch-profiler.properties`.
@@ -147,7 +214,7 @@ You can store both settings for the Profiler along with 
settings for Spark in th
 
 ### `profiler.batch.input.path`
 
-*Default*: "hdfs://localhost:9000/apps/metron/indexing/indexed/*/*"
+*Default*: hdfs://localhost:9000/apps/metron/indexing/indexed/\*/\*
 
 The path to the input data read by the Batch Profiler.
 
@@ -190,26 +257,3 @@ The name of the HBase table that profile data is written 
to.  The Profiler expec
 *Default*: P
 
 The column family used to store profile data in HBase.
-
-## Running the Profiler
-
-A script located at `$METRON_HOME/bin/start_batch_profiler.sh` has been 
provided to simplify running the Batch Profiler.  The Batch Profiler may also 
be started as follows using the `spark-submit` script.
-
-```
-${SPARK_HOME}/bin/spark-submit \
-    --class org.apache.metron.profiler.spark.cli.BatchProfilerCLI \
-    --properties-file ${SPARK_PROPS_FILE} \
-    ${PROFILER_JAR} \
-    --config ${PROFILER_PROPS_FILE} \
-    --profiles ${PROFILES_FILE}
-```
-
-The Batch Profiler also accepts the following command line arguments when run 
from the command line.
-
-| Argument         | Description
-|---               |---
-| -p, --profiles   | The path to a file containing the profile definitions.
-| -c, --config     | The path to the profiler properties file.
-| -g, --globals    | The path to a properties file containing global 
properties.
-| -h, --help       | Print the help text.
-

http://git-wip-us.apache.org/repos/asf/metron/blob/f83f0ac0/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties
----------------------------------------------------------------------
diff --git 
a/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties
 
b/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties
index c651791..400c526 100644
--- 
a/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties
+++ 
b/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties
@@ -16,5 +16,11 @@
 # limitations under the License.
 #
 #
-spark.master=local
 spark.app.name=Batch Profiler
+spark.master=local
+
+profiler.batch.input.path=hdfs://localhost:9000/apps/metron/indexing/indexed/*/*
+profiler.batch.input.format=text
+
+profiler.period.duration=15
+profiler.period.duration.units=MINUTES

Reply via email to