galv commented on pull request #34505: URL: https://github.com/apache/spark/pull/34505#issuecomment-966703239
There's something this PR has made me begin to ponder. The "stream" object used by is created by the socket library's makefile() API: https://docs.python.org/3/library/socket.html#socket.socket.makefile This means it is not a traditional file (i.e., the BSD socket API does not support posix read and write, so this is just a convenience provided by python). If a pipe were to be used instead of a socket, it seems conceivable that arrow data structures could be written to the pipe via the vmsplice() syscall, which would effectively do zero-copy movement of data from python to the JVM executor (I believe the virtual memory pages simply get assigned to the pipe file descriptor inside the kernel). My understanding was that the python worker.py process is always on the same machine as the JVM executor, so this seems like a reasonable speedup to consider. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
