BryanCutler commented on issue #24734: [SPARK-27870][SQL][PySpark] Flush each batch for pandas UDF (for improving pandas UDFs pipeline) URL: https://github.com/apache/spark/pull/24734#issuecomment-497160693 >If a buffered writer is used, it doesn't flush data to output stream immediately. So the reader on the other side, even trying to read the data from the stream at all time, cannot see the data until the buffered writer decides to flush. Right, with a buffered writer, the data is written once the buffer is full. With both the Scala and Python writers, they will write batches continuously in a loop, only blocking when the buffer is full and the other side hasn't read the previous data. If by chance a single batch doesn't fill the write buffer, then the next batch is written and so on until the buffer is full. The only time when the buffer could not be filled is the final batch, then the stream is closed, which calls a flush. Are you thinking of the case where an entire batch does not completely fill the write buffer?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
