[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

zjffdu Thu, 25 Jan 2018 21:47:01 -0800

Github user zjffdu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13599#discussion_r164037239
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
    @@ -98,7 +98,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
       private val reviveThread =
         
ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-revive-thread")
     
    -  class DriverEndpoint(override val rpcEnv: RpcEnv, sparkProperties: 
Seq[(String, String)])
    +  class DriverEndpoint(override val rpcEnv: RpcEnv)
    --- End diff --
    
    Without this change, the following scenario won't work. 
    1. Launch spark app
    2. call `sc.install_packages("numpy")`
    3. run `sc.range(3).map(lambda x: np.__version__).collect()`
    4. Restart executor (by kill it, scheduler will scheduler another executor)
    5. run `sc.range(3).map(lambda x: np.__version___.collect()` again, this 
time it would fail. Because the new scheduled executor can not set up 
virtualenv correctly as it can not get the updated 
`spark.pyspark.virtualenv.packages`.
    
    That's why make this change in core. Now executor would always get the 
updated SparkConf instead of the SparkConf created when spark app is started. 
    
    There's some overhead, but I believe it is very trivial, and could be 
improved later.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

Reply via email to