Spark has a launch script as hadoop does. We use the Hadoop launcher script but not the Spark one. When starting up your Spark cluster there is a spark-env.sh script that can set a bunch of environment variables. In our own mahoutSparkContext function, which takes the place of the Spark submit script and launcher we don’t account for most of the environment variables.
Unless I missed something this means most of the documented options will be ignored unless a user of Mahout parses and sets them in their own SparkConf. The Mahout CLI drivers don’t do this for all possible options, only supporting a few like job name and spark.executor.memory. The question is how to best handle these Spark options. There seem to be two options: 1) use sparks launch mechanism for drivers but allow some to be overridden in the CLI 2) add parsing the env for options and set up the SparkConf default in mahoutSparkContext with those variables. The downside of #2 is that as variables change we’ll have to reflect those in our code. I forget why #1 is not an option but Dmitriy has been consistently against this—in any case it would mean a fair bit of refactoring I believe. Any opinions or corrections?
