[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

vogxn Mon, 07 Nov 2016 03:25:54 -0800

Github user vogxn commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13599#discussion_r86756833
  
    --- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -69,6 +84,66 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
       }
     
       /**
    +   * Create virtualenv using native virtualenv or conda
    +   *
    +   * Native Virtualenv:
    +   *   -  Execute command: virtualenv -p pythonExec --no-site-packages 
virtualenvName
    +   *   -  Execute command: python -m pip --cache-dir cache-dir install -r 
requirement_file
    +   *
    +   * Conda
    +   *   -  Execute command: conda create --prefix prefix --file 
requirement_file -y
    +   *
    +   */
    +  def setupVirtualEnv(): Unit = {
    +    logDebug("Start to setup virtualenv...")
    +    logDebug("user.dir=" + System.getProperty("user.dir"))
    +    logDebug("user.home=" + System.getProperty("user.home"))
    +
    +    require(virtualEnvType == "native" || virtualEnvType == "conda",
    +      s"VirtualEnvType: ${virtualEnvType} is not supported" )
    +    virtualEnvName = "virtualenv_" + conf.getAppId + "_" + 
VIRTUALENV_ID.getAndIncrement()
    +    // use the absolute path when it is local mode otherwise just use 
filename as it would be
    +    // fetched from FileServer
    +    val pyspark_requirements =
    +      if (Utils.isLocalMaster(conf)) {
    +        conf.get("spark.pyspark.virtualenv.requirements")
    +      } else {
    +        conf.get("spark.pyspark.virtualenv.requirements").split("/").last
    +      }
    +
    +    val createEnvCommand =
    +      if (virtualEnvType == "native") {
    +        Arrays.asList(virtualEnvPath,
    +          "-p", pythonExec,
    +          "--no-site-packages", virtualEnvName)
    +      } else {
    +        Arrays.asList(virtualEnvPath,
    +          "create", "--prefix", System.getProperty("user.dir") + "/" + 
virtualEnvName,
    --- End diff --
    
    Started writing this comment and had to recompile my cluster. I figured I 
had made a mistake in the permissions. Apologise for the false alarm. The patch 
works fine and I'm able to run executors with the conda environment. I'll do 
some more testing from my end.
    
    ===== Following was my setup =====
    Apache Spark (with this patch) is compiled with Apache Hadoop 2.6.0. I've 
installed `anaconda2-4.1.1` on all my nodes in the cluster under 
`/usr/lib/anaconda2`. I can create conda environments using the command `conda 
create --prefix test-env numpy -y` fine.
    
    The following shell script is used to submit my pyspark programs:
    
    ```
    $ cat run.sh
    /usr/lib/spark/bin/spark-submit  --master yarn --deploy-mode client \
        --conf "spark.pyspark.virtualenv.enabled=true" \
        --conf "spark.pyspark.virtualenv.type=conda" \
        --conf "spark.pyspark.virtualenv.requirements=/home/tsp/conda.txt" \
        --conf "spark.pyspark.virtualenv.bin.path=/usr/lib/anaconda2/bin/conda" 
"$@"
    ```
    
    This is the program I've submitted to see if the anaconda environment is 
detected in the executors
    
    ```
    $ cat execinfo.py
    from pyspark import SparkContext
    import sys
    
    if __name__ == '__main__':
      sc = SparkContext()
      print sys.version
      print sc.parallelize(range(1,2)).map(lambda x: sys.version).collect()
    ```
    
    This is what is seen in the debug logs
    ```
    Caused by: java.lang.RuntimeException: Fail to run command: 
/usr/lib/anaconda2/bin/conda create --prefix 
/media/ebs2/yarn/local/usercache/tsp/appcache/application_1478497303110_0005/container_1478497303110_0005_01_000003/virtualenv_application_1478497303110_0005_3-
    -file conda.txt -y
            at 
org.apache.spark.api.python.PythonWorkerFactory.execCommand(PythonWorkerFactory.scala:142)
            at 
org.apache.spark.api.python.PythonWorkerFactory.setupVirtualEnv(PythonWorkerFactory.scala:124)
            at 
org.apache.spark.api.python.PythonWorkerFactory.<init>(PythonWorkerFactory.scala:70)
    ```
    
    `/media/ebs2/yarn` is owned by `yarn (id): hadoop (gid)`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

Reply via email to