
I was looking at the spark-submit and spark-shell --help  on both (Spark 1.3.1 
and Spark 1.5-snapshot) versions and the Spark documentation for submitting 
Spark applications to YARN. It seems to be there is some mismatch in the 
preferred syntax and documentation. 

Spark documentation 
says that we need to specify either yarn-cluster or yarn-client to connect to a 
yarn cluster. 

yarn-client     Connect to a YARN  
<http://spark.apache.org/docs/latest/running-on-yarn.html>cluster in client 
mode. The cluster location will be found based on the HADOOP_CONF_DIR or 
YARN_CONF_DIR variable.
yarn-cluster    Connect to a YARN  
<http://spark.apache.org/docs/latest/running-on-yarn.html>cluster in cluster 
mode. The cluster location will be found based on the HADOOP_CONF_DIR or 
YARN_CONF_DIR variable.
In the spark-submit --help it says the following Options: --master yarn 
--deploy-mode cluster or client.

Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or 
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
                              on one of the worker machines inside the cluster 
                              (Default: client).

I want to bring this to your attention as this is a bit confusing for someone 
running Spark on YARN. For example, they look at the spark-submit help command 
and start using the syntax, but when they look at online documentation or 
user-group mailing list, they see different spark-submit syntax. 

From a quick discussion with other engineers at Cloudera it seems like 
—deploy-mode is preferred as it is more consistent with the way things are done 
with other cluster managers, i.e. there is no standalone-cluster or 
standalone-client masters. This applies to Mesos as well.

Either syntax works, but I would like to propose to use ‘-master yarn 
—deploy-mode x’ instead of ‘-master yarn-cluster or -master yarn-client’ as it 
is consistent with other cluster managers . This would require updating all 
Spark pages related to submitting Spark applications to YARN.

So far I’ve identified the following pages.

1) http://spark.apache.org/docs/latest/running-on-yarn.html 
2) http://spark.apache.org/docs/latest/submitting-applications.html#master-urls 

There is a JIRA to track the progress on this as well.

The option we choose dictates whether we update the documentation  or 
spark-submit and spark-shell help pages.  

Any thoughts which direction we should go? 

Guru Medasani

Reply via email to