holdenk commented on issue #24070: [SPARK-23961][PYTHON] Fix error when 
toLocalIterator goes out of scope
URL: https://github.com/apache/spark/pull/24070#issuecomment-478645478
 
 
   So before I jump into the details of this PR, from the design side I've got 
some questions I'd like to try and understand better:
   
   1) This description mentions that it may save "unneeded work" but not 
eagerly queueing jobs that aren't needed, but in the situation where the jobs 
are needed it seems like we're adding unnecessary synchronization that could 
slow this down.
   - Have you tried the benchmarks with a slow per-partition computation to see 
if there is more impact there?
   - Is this slow down "worth it", 20% on DataFrame is nothing to walk away 
from if we can avoid it.
   - If we're worried about the driver program falling over from buffering  too 
much data in the JVM can we maybe queue the next partition while we are serving 
the current one to decrease the overhead?
   
   2) Are we solving this problem on the "right" side? Is it common for the the 
localiterator to go out of scope before being consumed?
   - If it is is a relatively infrequent occurrence, would it possible make 
sense to just make the Java code accept that the Python socket may go away? I 
think that might keep the "happy" (or more frequent) path at reasonable 
performance. On the other hand if folks are using toLocalIterator to partially 
consume the data and this is the expected path, then using exception handling 
might not make sense.
   
   I'm going to dig into the code, but I'd love to get a better understanding 
of the design as well.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to