BryanCutler commented on issue #24070: [SPARK-23961][PYTHON] Fix error when 
toLocalIterator goes out of scope
URL: https://github.com/apache/spark/pull/24070#issuecomment-480104456
 
 
   > Are we solving this problem on the "right" side? Is it common for the the 
localiterator to go out of scope before being consumed?
   If it is is a relatively infrequent occurrence, would it possible make sense 
to just make the Java code accept that the Python socket may go away?
   
   Thanks @holdenk , I totally agree.  I started off with a fix that would 
catch the error on the Java side and just allow the connection to be broken 
without showing the error.  Let me summarize the trade-offs that I noticed:
   
   **Catch/Ignore error in Java:**
   - No change in the protocol to Python, keeps things the same
   - This is the fastest way to iterate through all elements
   - Ignoring the error _might_ hide an actual error, but this is over a local 
socket so once the connection is made I'm not sure how likely an error is
   - While Python is still consuming data from one partition, other jobs will 
be started to collect the next partitions until write to the socket is blocked 
again.  This is nice if all the data is used, but bad if it's not.
   
   **Synchronized Protocol (this PR):**
   - Allows for the iterator to go out of scope and close the connection 
cleanly, without needing to catch/ignore errors.
   - Jobs are only triggered when the next element is needed. So only after the 
last element of  a partition is consumed, the next partition is collected. This 
matches the behavior of Scala toLocalIterator.
   - Performance is a little slower because of synchronization and no longer 
eagerly collecting the next partition
   - Makes the communication between Python more complicated
   
   I went with the second option here because I definitely didn't want to mask 
potential real errors, and only triggering jobs for data that is requested, 
matches Scala and is good for some use cases. Also while we don't want 
performance to suck, I think people would use toLocalIterator to work with 
constrained memory (1 partition at a time) and do a regular collect if they 
want good performance.
   
   Btw, from the JIRA for this, the user was calling `itertools.islice(..)` to 
take a slice of the the local iterator, which wasn't fully consuming it. I 
would agree it is probably more common to consume the entire iterator though.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to