[GitHub] [spark] galv commented on pull request #34505: [SPARK-37228][SQL][PYTHON] Implement DataFrame.mapInArrow in Python

GitBox Thu, 11 Nov 2021 15:42:09 -0800


galv commented on pull request #34505:
URL: https://github.com/apache/spark/pull/34505#issuecomment-966703239



   There's something this PR has made me begin to ponder. The "stream" object 
used by is created by the socket library's makefile() API: 
https://docs.python.org/3/library/socket.html#socket.socket.makefile
   
   This means it is not a traditional file (i.e., the BSD socket API does not 
support posix read and write, so this is just a convenience provided by 
python). If a pipe were to be used instead of a socket, it seems conceivable 
that arrow data structures could be written to the pipe via the vmsplice() 
syscall, which would effectively do zero-copy movement of data from python to 
the JVM executor (I believe the virtual memory pages simply get assigned to the 
pipe file descriptor inside the kernel). My understanding was that the python 
worker.py process is always on the same machine as the JVM executor, so this 
seems like a reasonable speedup to consider.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] galv commented on pull request #34505: [SPARK-37228][SQL][PYTHON] Implement DataFrame.mapInArrow in Python

Reply via email to