Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/9241#issuecomment-151939041
  
    @JoshRosen This PR is already huge, i will stop to add new things (but will 
fix any problems), please do another round review, thanks!
    
    For thread-safety, here what I'm got:
    
    1) Without calling spill(), the operators should only be used by single 
thread, no safety problems.
    
    2) spill() could be triggered in two cases, triggered by itself, or by 
other operators. we can check `trigger ==  this` in `spill()`, so it's still in 
the same thread, so safety problems.
    
    3) if it's triggered by other operators (right now cache will not trigger 
spill()), we only spill the data into disk when it's in scanning stage 
(building is finished), so the in-memory sorter or memory pages are read-only, 
we only need to synchronize the iterator and change it.
    
    4) During scanning, the iterator will only use one record in one page, we 
can't free this page, because the downstream is currently using it (used by 
UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, 
and dump all others into disk. In UnsafeExternalSorter, we keep the page that 
is used by current record (having the same baseObject), free it when loading 
the next record. In ShuffleExternalSorter, the spill() will not trigger during 
scanning.
    
    5) In order to avoid deadlock, we didn't call acquireMemory during spill 
(so we reused the pointer array in InMemorySorter). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to