Hi, I was looking at the spark-submit and spark-shell --help on both (Spark 1.3.1 and Spark 1.5-snapshot) versions and the Spark documentation for submitting Spark applications to YARN. It seems to be there is some mismatch in the preferred syntax and documentation.
Spark documentation <http://spark.apache.org/docs/latest/submitting-applications.html#master-urls> says that we need to specify either yarn-cluster or yarn-client to connect to a yarn cluster. yarn-client Connect to a YARN <http://spark.apache.org/docs/latest/running-on-yarn.html>cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. yarn-cluster Connect to a YARN <http://spark.apache.org/docs/latest/running-on-yarn.html>cluster in cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. In the spark-submit --help it says the following Options: --master yarn --deploy-mode cluster or client. Usage: spark-submit [options] <app jar | python file> [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). I want to bring this to your attention as this is a bit confusing for someone running Spark on YARN. For example, they look at the spark-submit help command and start using the syntax, but when they look at online documentation or user-group mailing list, they see different spark-submit syntax. From a quick discussion with other engineers at Cloudera it seems like —deploy-mode is preferred as it is more consistent with the way things are done with other cluster managers, i.e. there is no standalone-cluster or standalone-client masters. This applies to Mesos as well. Either syntax works, but I would like to propose to use ‘-master yarn —deploy-mode x’ instead of ‘-master yarn-cluster or -master yarn-client’ as it is consistent with other cluster managers . This would require updating all Spark pages related to submitting Spark applications to YARN. So far I’ve identified the following pages. 1) http://spark.apache.org/docs/latest/running-on-yarn.html <http://spark.apache.org/docs/latest/running-on-yarn.html> 2) http://spark.apache.org/docs/latest/submitting-applications.html#master-urls <http://spark.apache.org/docs/latest/submitting-applications.html#master-urls> There is a JIRA to track the progress on this as well. https://issues.apache.org/jira/browse/SPARK-9570 <https://issues.apache.org/jira/browse/SPARK-9570> The option we choose dictates whether we update the documentation or spark-submit and spark-shell help pages. Any thoughts which direction we should go? Guru Medasani gdm...@gmail.com