neilconway opened a new pull request, #22893:
URL: https://github.com/apache/datafusion/pull/22893

   ## Which issue does this PR close?
   
   - Closes #22875
   
   ## Rationale for this change
   
   Previously, `HashMap`-backed hash joins included NULLs but `ArrayMap`-backed 
hash joins omitted them. Under `NullEqualsNothing`, we can safely omit rows 
that have a NULL in any of their join keys, because they will never contribute 
to the output of the join. Omitting NULLs reduces the size of the build-side 
hash table.
   
   The previous probe behavior also resulted in searching the hash table for 
probe rows with NULLs in their join keys. This was wasted work; indeed, because 
all NULL build rows will end up in the same hash chain, this could actually be 
very expensive for joins over NULL-heavy data sets. For example, joining two 
10k tables on all-NULL join keys took ~6 seconds (!). That drops to a few 
milliseconds after this PR.
   
   ## What changes are included in this PR?
   
   * Omit build rows with one or more NULLs in their join keys from `HashMap`
   * Don't probe the map for probe rows with NULLs in their join keys
   * Fix a few places that assumes that an empty build-side hash table meant 
the build input was empty
   * Add unit tests
   
   ## Are these changes tested?
   
   Yes; new tests added.
   
   ## Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to