METRON-1770 Add Docs for Running the Profiler with Spark on YARN (nickwallen) closes apache/metron#1189
Project: http://git-wip-us.apache.org/repos/asf/metron/repo Commit: http://git-wip-us.apache.org/repos/asf/metron/commit/f83f0ac0 Tree: http://git-wip-us.apache.org/repos/asf/metron/tree/f83f0ac0 Diff: http://git-wip-us.apache.org/repos/asf/metron/diff/f83f0ac0 Branch: refs/heads/master Commit: f83f0ac06622e091a09d9f256f817e7235c63e53 Parents: cad2f40 Author: nickwallen <n...@nickallen.org> Authored: Wed Sep 19 10:01:50 2018 -0400 Committer: nickallen <nickal...@apache.org> Committed: Wed Sep 19 10:01:50 2018 -0400 ---------------------------------------------------------------------- .../metron-profiler-spark/README.md | 94 ++++++++++++++------ .../src/main/config/batch-profiler.properties | 8 +- 2 files changed, 76 insertions(+), 26 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/metron/blob/f83f0ac0/metron-analytics/metron-profiler-spark/README.md ---------------------------------------------------------------------- diff --git a/metron-analytics/metron-profiler-spark/README.md b/metron-analytics/metron-profiler-spark/README.md index d137e51..3d7017c 100644 --- a/metron-analytics/metron-profiler-spark/README.md +++ b/metron-analytics/metron-profiler-spark/README.md @@ -22,8 +22,8 @@ This project allows profiles to be executed using [Apache Spark](https://spark.a * [Introduction](#introduction) * [Getting Started](#getting-started) * [Installation](#installation) -* [Configuring the Profiler](#configuring-the-profiler) * [Running the Profiler](#running-the-profiler) +* [Configuring the Profiler](#configuring-the-profiler) ## Introduction @@ -129,6 +129,73 @@ The Batch Profiler requires Spark version 2.3.0+. find ./ -name "metron-profiler-spark*.deb" ``` +## Running the Profiler + +A script located at `$METRON_HOME/bin/start_batch_profiler.sh` has been provided to simplify running the Batch Profiler. This script makes the following assumptions. + + * The script builds the profiles defined in `$METRON_HOME/config/zookeeper/profiler.json`. + + * The properties defined in `$METRON_HOME/config/batch-profiler.properties` are passed to both the Profiler and Spark. You can define both Spark and Profiler properties in this same file. + + * The script assumes that Spark is installed at `/usr/hdp/current/spark2-client`. This can be overridden if you define an environment variable called `SPARK_HOME` prior to executing the script. + +### Advanced Usage + +The Batch Profiler may also be started using `spark-submit` as follows. See the Spark Documentation for more information about [`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit). + +``` +${SPARK_HOME}/bin/spark-submit \ + --class org.apache.metron.profiler.spark.cli.BatchProfilerCLI \ + --properties-file ${SPARK_PROPS_FILE} \ + ${METRON_HOME}/lib/metron-profiler-spark-*.jar \ + --config ${PROFILER_PROPS_FILE} \ + --profiles ${PROFILES_FILE} +``` + +The Batch Profiler accepts the following arguments when run from the command line as shown above. All arguments following the Profiler jar are passed to the Profiler. All argument preceeding the Profiler jar are passed to Spark. + +| Argument | Description +|--- |--- +| -p, --profiles | The path to a file containing the profile definitions. +| -c, --config | The path to the profiler properties file. +| -g, --globals | The path to a properties file containing global properties. +| -h, --help | Print the help text. + +### Spark Execution + +Spark supports a number of different [cluster managers](https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types). The underlying cluster manager is transparent to the Profiler. To run the Profiler on a particular cluster manager, it is just a matter of setting the appropriate options as defined in the Spark documentation. + +#### Local Mode + +By default, the Batch Profiler instructs Spark to run in local mode. This will run all of the Spark execution components within a single JVM. This mode is only useful for testing with a limited set of data. + +`$METRON_HOME/config/batch-profiler.properties` +``` +spark.master=local +``` + +#### Spark on YARN + +To run the Profiler using [Spark on YARN](https://spark.apache.org/docs/latest/running-on-yarn.html#running-spark-on-yarn), at a minimum edit the value of `spark.master` as shown. In many cases it also makes sense to set the YARN [deploy mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn) to `cluster`. + +`$METRON_HOME/config/batch-profiler.properties` +``` +spark.master=yarn +spark.submit.deployMode=cluster +``` + +See the Spark documentation for information on how to further control the execution of Spark on YARN. Any of [these properties](http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties) can be added to the Profiler properties file. + +The following command can be useful to review the logs generated when the Profiler is executed on YARN. +``` +yarn logs -applicationId <application-id> +``` + +#### Kerberos + +See the Spark documentation for information on running the Batch Profiler in a [secure, kerberized cluster](https://spark.apache.org/docs/latest/running-on-yarn.html#running-in-a-secure-cluster). + + ## Configuring the Profiler By default, the configuration for the Batch Profiler is stored in the local filesystem at `$METRON_HOME/config/batch-profiler.properties`. @@ -147,7 +214,7 @@ You can store both settings for the Profiler along with settings for Spark in th ### `profiler.batch.input.path` -*Default*: "hdfs://localhost:9000/apps/metron/indexing/indexed/*/*" +*Default*: hdfs://localhost:9000/apps/metron/indexing/indexed/\*/\* The path to the input data read by the Batch Profiler. @@ -190,26 +257,3 @@ The name of the HBase table that profile data is written to. The Profiler expec *Default*: P The column family used to store profile data in HBase. - -## Running the Profiler - -A script located at `$METRON_HOME/bin/start_batch_profiler.sh` has been provided to simplify running the Batch Profiler. The Batch Profiler may also be started as follows using the `spark-submit` script. - -``` -${SPARK_HOME}/bin/spark-submit \ - --class org.apache.metron.profiler.spark.cli.BatchProfilerCLI \ - --properties-file ${SPARK_PROPS_FILE} \ - ${PROFILER_JAR} \ - --config ${PROFILER_PROPS_FILE} \ - --profiles ${PROFILES_FILE} -``` - -The Batch Profiler also accepts the following command line arguments when run from the command line. - -| Argument | Description -|--- |--- -| -p, --profiles | The path to a file containing the profile definitions. -| -c, --config | The path to the profiler properties file. -| -g, --globals | The path to a properties file containing global properties. -| -h, --help | Print the help text. - http://git-wip-us.apache.org/repos/asf/metron/blob/f83f0ac0/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties ---------------------------------------------------------------------- diff --git a/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties b/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties index c651791..400c526 100644 --- a/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties +++ b/metron-analytics/metron-profiler-spark/src/main/config/batch-profiler.properties @@ -16,5 +16,11 @@ # limitations under the License. # # -spark.master=local spark.app.name=Batch Profiler +spark.master=local + +profiler.batch.input.path=hdfs://localhost:9000/apps/metron/indexing/indexed/*/* +profiler.batch.input.format=text + +profiler.period.duration=15 +profiler.period.duration.units=MINUTES