Hi,

I was looking at the spark-submit and spark-shell --help  on both (Spark 1.3.1 
and Spark 1.5-snapshot) versions and the Spark documentation for submitting 
Spark applications to YARN. It seems to be there is some mismatch in the 
preferred syntax and documentation. 

Spark documentation 
<http://spark.apache.org/docs/latest/submitting-applications.html#master-urls> 
says that we need to specify either yarn-cluster or yarn-client to connect to a 
yarn cluster. 


yarn-client     Connect to a YARN  
<http://spark.apache.org/docs/latest/running-on-yarn.html>cluster in client 
mode. The cluster location will be found based on the HADOOP_CONF_DIR or 
YARN_CONF_DIR variable.
yarn-cluster    Connect to a YARN  
<http://spark.apache.org/docs/latest/running-on-yarn.html>cluster in cluster 
mode. The cluster location will be found based on the HADOOP_CONF_DIR or 
YARN_CONF_DIR variable.
In the spark-submit --help it says the following Options: --master yarn 
--deploy-mode cluster or client.

Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
                              on one of the worker machines inside the cluster 
("cluster")
                              (Default: client).

I want to bring this to your attention as this is a bit confusing for someone 
running Spark on YARN. For example, they look at the spark-submit help command 
and start using the syntax, but when they look at online documentation or 
user-group mailing list, they see different spark-submit syntax. 

From a quick discussion with other engineers at Cloudera it seems like 
—deploy-mode is preferred as it is more consistent with the way things are done 
with other cluster managers, i.e. there is no standalone-cluster or 
standalone-client masters. This applies to Mesos as well.

Either syntax works, but I would like to propose to use ‘-master yarn 
—deploy-mode x’ instead of ‘-master yarn-cluster or -master yarn-client’ as it 
is consistent with other cluster managers . This would require updating all 
Spark pages related to submitting Spark applications to YARN.

So far I’ve identified the following pages.

1) http://spark.apache.org/docs/latest/running-on-yarn.html 
<http://spark.apache.org/docs/latest/running-on-yarn.html>
2) http://spark.apache.org/docs/latest/submitting-applications.html#master-urls 
<http://spark.apache.org/docs/latest/submitting-applications.html#master-urls>

There is a JIRA to track the progress on this as well.

https://issues.apache.org/jira/browse/SPARK-9570 
<https://issues.apache.org/jira/browse/SPARK-9570>
 
The option we choose dictates whether we update the documentation  or 
spark-submit and spark-shell help pages.  

Any thoughts which direction we should go? 

Guru Medasani
gdm...@gmail.com



Reply via email to