[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...

HyukjinKwon Mon, 13 Aug 2018 22:44:18 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22085#discussion_r209833972
  
    --- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
    @@ -180,7 +183,42 @@ private[spark] abstract class BasePythonRunner[IN, 
OUT](
             dataOut.writeInt(partitionIndex)
             // Python version of driver
             PythonRDD.writeUTF(pythonVer, dataOut)
    +        // Init a GatewayServer to port current BarrierTaskContext to 
Python side.
    +        val isBarrier = context.isInstanceOf[BarrierTaskContext]
    +        val secret = if (isBarrier) {
    +          Utils.createSecret(env.conf)
    +        } else {
    +          ""
    +        }
    +        val gatewayServer: Option[GatewayServer] = if (isBarrier) {
    +          Some(new GatewayServer.GatewayServerBuilder()
    +            .entryPoint(context.asInstanceOf[BarrierTaskContext])
    +            .authToken(secret)
    +            .javaPort(0)
    +            .callbackClient(GatewayServer.DEFAULT_PYTHON_PORT, 
GatewayServer.defaultAddress(),
    +              secret)
    +            .build())
    --- End diff --
    
    Mainly the reason is about resource usage, unusual access pattern via Py4J 
at worker side, and the possibility of allowing JVM access within Python worker.
    
    It pretty much looks an overkill to launch a Java gateway to allow access 
to call a function assuming from 
https://github.com/apache/spark/pull/22085#discussion_r209490553. This pattern 
sounds pretty unusual - such cases, we usually send the data manually and read 
it in Python side, for instance `TaskContext`. Now, it opens a gateway for each 
worker if I am not mistaken.
    
    I was thinking if we can avoid this. Can you elaborate why this is required 
and necessary? I haven't got enough time to look into this so was thinking 
about taking a look on this weekends.
    
    This also now opens a possibility for an JVM access from worker side via 
`BarrierTaskContext`. For instance, I believe we can hack and access to JVM 
inside of UDFs.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...

Reply via email to