Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/8662#issuecomment-141158021
  
    @rxin The reason CachedOnceRDD looks a little bit complicated than expected 
is that the order of two caller (zip and writer thread) is undefined (they are 
in two threads), usually the zip will lag behind writer thread, but zip could 
call compute() before writer thread. This could avoid the deadlock without do 
flushing in PythonRDD pipeline.
    
    The buffered rows are bounded to the number of rows in PythonRDD, we could 
don't do spilling here (there is still a chance to OOM, but small).
    
    For 1.5.1, the goals are:
    1) compute the upstream once
    2) no deadlock
    3) small risk (fewer changes in PythonRDD)
    4) no much performance regression
    
    Which one should we pick for 1.5.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to