[GitHub] [spark] sigmod edited a comment on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

GitBox Thu, 03 Jun 2021 23:39:47 -0700


sigmod edited a comment on pull request #32210:
URL: https://github.com/apache/spark/pull/32210#issuecomment-854403695



   > > > Is 3.1 + 3.2 correct? Let's say, if the build side have two rows r1 
and r2, with identical values on join key columns. r1 is processed before OOM 
and inserted into the in memory hash table, while r2 is processed after the 
OOM. Now a probe side row p, with the same join key column values can only 
match r1. p should match r2 as well for correctness.
   > > 
   > > 
   > > @sigmod - yes you are right. Sorry for missing it. Updated above. For 
non-semi join, we need to put the rows from stream side to sorter anyway, 
thanks.
   > 
   > Ok, it sounds like a working join algorithm that at least does not error 
out a query because of OOM.
   > 
   > But the proposal is less efficient compared to hybrid hash join in a few 
aspects:
   > 
   > * (1) HHJ doesn't have to spill all probe rows, because build side has 
been partitioned - a build side partition is either fully memory-resident or 
fully spilled -- it's the same for corresponding probe rows;
   > * (2)  It's not friendly to the case that the build side just slightly 
goes beyond RAM.  E.g., let's say, in memory hash table's size is H before OOM, 
build side total size is H + h, probe side size is P.  If P >> H >> h, meaning 
we just slightly go beyond RAM size, we have sort the entire probe side.
   > 
   > Given that it's not worse than what the existing HJ does - running slowly 
is better than an error, I'm fine with merging this new proposal :-)
   
   PS, there's some additional complexity to get left/full-outer joins done 
right (assuming probe side is on the left) -- for a LOJ/FOJ, it seems that you 
need to tag each probe side row with "matched" v.s. "not-matched-yet",  and 
then use this information in the customized merge join logic. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sigmod edited a comment on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

Reply via email to