Jefffrey opened a new issue, #20152:
URL: https://github.com/apache/datafusion/issues/20152

   ### Is your feature request related to a problem or challenge?
   
   
https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L398-L444
   
   - Dictionary doesn't consider if there are 0 null keys + has rehash check 
inside the hotloop
   
   
https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L455-L468
   
   - Struct collects the valid indices even if there are no nulls
   
   
https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L487-L489
   
   
https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L574-L576
   
   
https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L643-L645
   
   - Map/ListView/FixedSizeList checks for existence of null buffer only, not 
for if there are nulls (null_count)
   
   
https://github.com/apache/datafusion/blob/b80bf2ca8ef74900fee96a1cc169bdedf53b36fc/datafusion/common/src/hash_utils.rs#L713-L715
   
   - Run array has rehash check inside hotloop
   
   
   ### Describe the solution you'd like
   
   - Always check for null path based on null_count; for dictionary + struct we 
should have separate paths for when there are nulls vs when there are no nulls 
(to be consistent with how the other functions handle this)
     - This may be a pedantic case (how often would we have an array with a 
null buffer but no nulls in it?), but can consider cases like having an array 
with nulls, but then slicing into a section of it that contains no null values
   - Pull rehash outside the hotloop
   
   ### Describe alternatives you've considered
   
   If there aren't noticeable performance improvements then might not be worth 
considering. Maybe branch prediction is good enough for having rehash check 
inside the hotloop since that wouldn't change per iteration 🤔 
   
   ### Additional context
   
   Can take inspiration from
   
   - https://github.com/apache/datafusion/pull/19374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to