Spark has a launch script as hadoop does. We use the Hadoop launcher script but 
not the Spark one. When starting up your Spark cluster there is a spark-env.sh 
script that can set a bunch of environment variables. In our own 
mahoutSparkContext function, which takes the place of the Spark submit script 
and launcher we don’t account for most of the environment variables.

Unless I missed something this means most of the documented options will be 
ignored unless a user of Mahout parses and sets them in their own SparkConf. 
The Mahout CLI drivers don’t do this for all possible options, only supporting 
a few like job name and spark.executor.memory.

The question is how to best handle these Spark options. There seem to be two 
options:
1) use sparks launch mechanism for drivers but allow some to be overridden in 
the CLI
2) add parsing the env for options and set up the SparkConf default in 
mahoutSparkContext with those variables. 

The downside of #2 is that as variables change we’ll have to reflect those in 
our code. I forget why #1 is not an option but Dmitriy has been consistently 
against this—in any case it would mean a fair bit of refactoring I believe.

Any opinions or corrections?

Reply via email to