Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22085#discussion_r209833972 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -180,7 +183,42 @@ private[spark] abstract class BasePythonRunner[IN, OUT]( dataOut.writeInt(partitionIndex) // Python version of driver PythonRDD.writeUTF(pythonVer, dataOut) + // Init a GatewayServer to port current BarrierTaskContext to Python side. + val isBarrier = context.isInstanceOf[BarrierTaskContext] + val secret = if (isBarrier) { + Utils.createSecret(env.conf) + } else { + "" + } + val gatewayServer: Option[GatewayServer] = if (isBarrier) { + Some(new GatewayServer.GatewayServerBuilder() + .entryPoint(context.asInstanceOf[BarrierTaskContext]) + .authToken(secret) + .javaPort(0) + .callbackClient(GatewayServer.DEFAULT_PYTHON_PORT, GatewayServer.defaultAddress(), + secret) + .build()) --- End diff -- Mainly the reason is about resource usage, unusual access pattern via Py4J at worker side, and the possibility of allowing JVM access within Python worker. It pretty much looks an overkill to launch a Java gateway to allow access to call a function assuming from https://github.com/apache/spark/pull/22085#discussion_r209490553. This pattern sounds pretty unusual - such cases, we usually send the data manually and read it in Python side, for instance `TaskContext`. Now, it opens a gateway for each worker if I am not mistaken. I was thinking if we can avoid this. Can you elaborate why this is required and necessary? I haven't got enough time to look into this so was thinking about taking a look on this weekends. This also now opens a possibility for an JVM access from worker side via `BarrierTaskContext`. For instance, I believe we can hack and access to JVM inside of UDFs.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org