BryanCutler commented on a change in pull request #24826: 
[SPARK-27870][SQL][PYTHON] Add a runtime buffer size configuration for Pandas 
UDFs
URL: https://github.com/apache/spark/pull/24826#discussion_r292100236
 
 

 ##########
 File path: python/pyspark/daemon.py
 ##########
 @@ -54,8 +54,9 @@ def worker(sock, authenticated):
     # Read the socket using fdopen instead of socket.makefile() because the 
latter
     # seems to be very slow; note that we need to dup() the file descriptor 
because
     # otherwise writes also cause a seek that makes us miss data on the read 
side.
-    infile = os.fdopen(os.dup(sock.fileno()), "rb", 65536)
-    outfile = os.fdopen(os.dup(sock.fileno()), "wb", 65536)
+    buffer_size = int(os.environ.get("SPARK_BUFFER_SIZE", 65536))
+    infile = os.fdopen(os.dup(sock.fileno()), "rb", buffer_size)
+    outfile = os.fdopen(os.dup(sock.fileno()), "wb", buffer_size)
 
 Review comment:
   Isn't it possible for the worker to be reused with the conf set to a 
different buffer size, that won't be used because the socket is already open?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to