[GitHub] [spark] BryanCutler commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

GitBox Wed, 29 May 2019 17:57:05 -0700

BryanCutler commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush each 
batch for pandas UDF (for improving pandas UDFs pipeline)
URL: https://github.com/apache/spark/pull/24734#issuecomment-497160693
 
 
   >If a buffered writer is used, it doesn't flush data to output stream 
immediately. So the reader on the other side, even trying to read the data from 
the stream at all time, cannot see the data until the buffered writer decides 
to flush.
   
   Right, with a buffered writer, the data is written once the buffer is full. 
With both the Scala and Python writers, they will write batches continuously in 
a loop, only blocking when the buffer is full and the other side hasn't read 
the previous data. If by chance a single batch doesn't fill the write buffer, 
then the next batch is written and so on until the buffer is full. The only 
time when the buffer could not be filled is the final batch, then the stream is 
closed, which calls a flush.
   
   Are you thinking of the case where an entire batch does not completely fill 
the write buffer?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] BryanCutler commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

Reply via email to