neilconway opened a new pull request, #22893: URL: https://github.com/apache/datafusion/pull/22893
## Which issue does this PR close? - Closes #22875 ## Rationale for this change Previously, `HashMap`-backed hash joins included NULLs but `ArrayMap`-backed hash joins omitted them. Under `NullEqualsNothing`, we can safely omit rows that have a NULL in any of their join keys, because they will never contribute to the output of the join. Omitting NULLs reduces the size of the build-side hash table. The previous probe behavior also resulted in searching the hash table for probe rows with NULLs in their join keys. This was wasted work; indeed, because all NULL build rows will end up in the same hash chain, this could actually be very expensive for joins over NULL-heavy data sets. For example, joining two 10k tables on all-NULL join keys took ~6 seconds (!). That drops to a few milliseconds after this PR. ## What changes are included in this PR? * Omit build rows with one or more NULLs in their join keys from `HashMap` * Don't probe the map for probe rows with NULLs in their join keys * Fix a few places that assumes that an empty build-side hash table meant the build input was empty * Add unit tests ## Are these changes tested? Yes; new tests added. ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
