BryanCutler opened a new pull request #24070: [SPARK-23961][PYTHON] Fix error 
when toLocalIterator goes out of scope
URL: https://github.com/apache/spark/pull/24070
 
 
   ## What changes were proposed in this pull request?
   
   This fixes an error when a PySpark local iterator, for both RDD and 
DataFrames, goes out of scope and the connection is closed before fully 
consuming the iterator. The error occurs on the JVM in the serving thread, when 
Python closes the local socket while the JVM is writing to it. This usually 
happens when there is enough data to fill the socket read buffer, causing the 
write call to block.
   
   The change here introduces a protocol for PySpark local iterators that work 
as follows:
   
   1) The local socket connection is made when the iterator is created
   2) When iterating, Python first sends a request for partition data as a 
non-zero integer
   3) While the JVM local iterator over partitions has next, it triggers a job 
to collect the next partition
   4) The partition is sent to Python and read by the PySpark deserializer
   5) After sending the entire partition, an `END_OF_DATA_SECTION` is sent to 
Python which stops the deserializer and allows to make another request
   6) When the JVM gets a request from Python but has already consumed it's 
local iterator, it will close the socket
   7) When the PySpark local iterator is garbage-collected, it will read any 
remaining data from the current partition (this is data that has already been 
collected) and send a request of zero to tell the JVM to stop collection jobs 
and close the connection.
   
   Steps 1, 3, 4, 6 are the same as before. The other steps add synchronization 
to allow for a clean closing of the socket, with a small trade-off in 
performance for each partition. This is mainly because the JVM does not start 
collecting partition data until it receives a request to do so, where before it 
would eagerly write all data until the socket receive buffer is full.
   
   ## How was this patch tested?
   
   Added new unit tests for DataFrame `toLocalIterator` and tested not fully 
consuming the iterator. Manual tests with Python 2.7  and 3.6.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to