Jefffrey opened a new issue, #20152: URL: https://github.com/apache/datafusion/issues/20152
### Is your feature request related to a problem or challenge? https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L398-L444 - Dictionary doesn't consider if there are 0 null keys + has rehash check inside the hotloop https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L455-L468 - Struct collects the valid indices even if there are no nulls https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L487-L489 https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L574-L576 https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L643-L645 - Map/ListView/FixedSizeList checks for existence of null buffer only, not for if there are nulls (null_count) https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L713-L715 - Run array has rehash check inside hotloop ### Describe the solution you'd like - Always check for null path based on null_count; for dictionary + struct we should have separate paths for when there are nulls vs when there are no nulls (to be consistent with how the other functions handle this) - This may be a pedantic case (how often would we have an array with a null buffer but no nulls in it?), but can consider cases like having an array with nulls, but then slicing into a section of it that contains no null values - Pull rehash outside the hotloop ### Describe alternatives you've considered If there aren't noticeable performance improvements then might not be worth considering. Maybe branch prediction is good enough for having rehash check inside the hotloop since that wouldn't change per iteration 🤔 ### Additional context Can take inspiration from - https://github.com/apache/datafusion/pull/19374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
