Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22085#discussion_r209833972
--- Diff:
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -180,7 +183,42 @@ private[spark] abstract class BasePythonRunner[IN,
OUT](
dataOut.writeInt(partitionIndex)
// Python version of driver
PythonRDD.writeUTF(pythonVer, dataOut)
+ // Init a GatewayServer to port current BarrierTaskContext to
Python side.
+ val isBarrier = context.isInstanceOf[BarrierTaskContext]
+ val secret = if (isBarrier) {
+ Utils.createSecret(env.conf)
+ } else {
+ ""
+ }
+ val gatewayServer: Option[GatewayServer] = if (isBarrier) {
+ Some(new GatewayServer.GatewayServerBuilder()
+ .entryPoint(context.asInstanceOf[BarrierTaskContext])
+ .authToken(secret)
+ .javaPort(0)
+ .callbackClient(GatewayServer.DEFAULT_PYTHON_PORT,
GatewayServer.defaultAddress(),
+ secret)
+ .build())
--- End diff --
Mainly the reason is about resource usage, unusual access pattern via Py4J
at worker side, and the possibility of allowing JVM access within Python worker.
It pretty much looks an overkill to launch a Java gateway to allow access
to call a function assuming from
https://github.com/apache/spark/pull/22085#discussion_r209490553. This pattern
sounds pretty unusual - such cases, we usually send the data manually and read
it in Python side, for instance `TaskContext`. Now, it opens a gateway for each
worker if I am not mistaken.
I was thinking if we can avoid this. Can you elaborate why this is required
and necessary? I haven't got enough time to look into this so was thinking
about taking a look on this weekends.
This also now opens a possibility for an JVM access from worker side via
`BarrierTaskContext`. For instance, I believe we can hack and access to JVM
inside of UDFs.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]