Github user vogxn commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r86756833 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -69,6 +84,66 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String } /** + * Create virtualenv using native virtualenv or conda + * + * Native Virtualenv: + * - Execute command: virtualenv -p pythonExec --no-site-packages virtualenvName + * - Execute command: python -m pip --cache-dir cache-dir install -r requirement_file + * + * Conda + * - Execute command: conda create --prefix prefix --file requirement_file -y + * + */ + def setupVirtualEnv(): Unit = { + logDebug("Start to setup virtualenv...") + logDebug("user.dir=" + System.getProperty("user.dir")) + logDebug("user.home=" + System.getProperty("user.home")) + + require(virtualEnvType == "native" || virtualEnvType == "conda", + s"VirtualEnvType: ${virtualEnvType} is not supported" ) + virtualEnvName = "virtualenv_" + conf.getAppId + "_" + VIRTUALENV_ID.getAndIncrement() + // use the absolute path when it is local mode otherwise just use filename as it would be + // fetched from FileServer + val pyspark_requirements = + if (Utils.isLocalMaster(conf)) { + conf.get("spark.pyspark.virtualenv.requirements") + } else { + conf.get("spark.pyspark.virtualenv.requirements").split("/").last + } + + val createEnvCommand = + if (virtualEnvType == "native") { + Arrays.asList(virtualEnvPath, + "-p", pythonExec, + "--no-site-packages", virtualEnvName) + } else { + Arrays.asList(virtualEnvPath, + "create", "--prefix", System.getProperty("user.dir") + "/" + virtualEnvName, --- End diff -- Started writing this comment and had to recompile my cluster. I figured I had made a mistake in the permissions. Apologise for the false alarm. The patch works fine and I'm able to run executors with the conda environment. I'll do some more testing from my end. ===== Following was my setup ===== Apache Spark (with this patch) is compiled with Apache Hadoop 2.6.0. I've installed `anaconda2-4.1.1` on all my nodes in the cluster under `/usr/lib/anaconda2`. I can create conda environments using the command `conda create --prefix test-env numpy -y` fine. The following shell script is used to submit my pyspark programs: ``` $ cat run.sh /usr/lib/spark/bin/spark-submit --master yarn --deploy-mode client \ --conf "spark.pyspark.virtualenv.enabled=true" \ --conf "spark.pyspark.virtualenv.type=conda" \ --conf "spark.pyspark.virtualenv.requirements=/home/tsp/conda.txt" \ --conf "spark.pyspark.virtualenv.bin.path=/usr/lib/anaconda2/bin/conda" "$@" ``` This is the program I've submitted to see if the anaconda environment is detected in the executors ``` $ cat execinfo.py from pyspark import SparkContext import sys if __name__ == '__main__': sc = SparkContext() print sys.version print sc.parallelize(range(1,2)).map(lambda x: sys.version).collect() ``` This is what is seen in the debug logs ``` Caused by: java.lang.RuntimeException: Fail to run command: /usr/lib/anaconda2/bin/conda create --prefix /media/ebs2/yarn/local/usercache/tsp/appcache/application_1478497303110_0005/container_1478497303110_0005_01_000003/virtualenv_application_1478497303110_0005_3- -file conda.txt -y at org.apache.spark.api.python.PythonWorkerFactory.execCommand(PythonWorkerFactory.scala:142) at org.apache.spark.api.python.PythonWorkerFactory.setupVirtualEnv(PythonWorkerFactory.scala:124) at org.apache.spark.api.python.PythonWorkerFactory.<init>(PythonWorkerFactory.scala:70) ``` `/media/ebs2/yarn` is owned by `yarn (id): hadoop (gid)`
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org