sigmod edited a comment on pull request #32210:
URL: https://github.com/apache/spark/pull/32210#issuecomment-826500384


   > Per my knowledge I don't know any obviously efficient way to do random 
lookup join with spilled hash map. 
   > How do we minimize random disk read for spilled map? Happy to brainstorm 
more if you have some rough ideas.
   
   `Hybrid hash join` was designed to address this issue: 
   - https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
   - https://cs-people.bu.edu/mathan/reading-groups/papers-classics/join.pdf
    
   In short, it's not to spill an entire, huge hash map onto disk, but combines 
in-memory hash join with data partitioning/spilling. It's a standard algorithm 
that been implemented in many query engines. Similarly, you could implement 
hash aggregation in the same way, which should be more efficient than the 
runtime fallback approach too. 
   
   IMO, either (1) AQE or (2) conservative static QO decisions (e.g., planning 
shuffled hash join less aggressively) might be some low-hanging fruits, if they 
can address the issue. But if indeed neither could work for some of your 
queries, supporting spilling could be viable option to go. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to