Re: [PR] [SPARK-46361][PYTHON][CORE] Spark dataset chunk read api (developer API) [spark]

via GitHub Tue, 02 Jan 2024 18:45:54 -0800


WeichenXu123 commented on code in PR #44294:
URL: https://github.com/apache/spark/pull/44294#discussion_r1440018036



##########
core/src/main/scala/org/apache/spark/SparkEnv.scala:
##########
@@ -99,6 +99,10 @@ class SparkEnv (
 
   private[spark] var executorBackend: Option[ExecutorBackend] = None
 
+  private[spark] var cachedArrowBatchServerPort: Option[Int] = None
+
+  private[spark] var cachedArrowBatchServerSecret: Option[String] = None

Review Comment:
   I am considering adding API like:
   
   ```
   # 1. User calls this developer API in pyspark UDF
   # to start a arrow stream server in local executor.
   server_port, server_secret = startChunkServer()
   
   # 2.read chunk data using the server created above.
   # user can call this function in pyspark UDF or descendent processes
   # of pyspark UDF.
   readChunk(chunk_id, server_port, server_secret)
   
   # 3. shut down the server created above
   shutdownChunkServer(server_port, server_secret)
   ```
   
   so that we can avoid each executor launches a long-running server.
   
https://docs.google.com/document/d/1qs8lKQ3IwF5QGGAaa6OIiXYhdG4_HJtS66dswtx9kd0/edit#bookmark=id.f6cwxc97g3ig
   
   Then we don't need these variables



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46361][PYTHON][CORE] Spark dataset chunk read api (developer API) [spark]

Reply via email to