[GitHub] [arrow-datafusion] alamb commented on issue #1456: The Eq method in HashAggregate takes up a lot of time, how to optimize it

GitBox Thu, 16 Dec 2021 13:23:10 -0800


alamb commented on issue #1456:
URL: 
https://github.com/apache/arrow-datafusion/issues/1456#issuecomment-996204065



   I think `eq_array` is a symptom, rather than the root cause
   
   The eq_array is necessary in the current hash aggregate  implementation to 
detect hash collisions:
   
   ```
                                               
┌──────────────────────────────────────┐
                   ┌─────────┐                 │Bucket                          
      │
                   │         │                 │(                               
      │
                   │HashTable│        ┌───────▶│ grp_key1: ScalarValue          
      │
                   │         │────────┘        │ grp_key2: ScalarValue          
      │
                   │         │                 │)                               
      │
                   │         │                 
└──────────────────────────────────────┘
                   └─────────┘                                                  
       
                                                                                
       
                                                                                
       
                                                                                
       
                                                                                
       
     ┌─────────────┬─────────────┐                                              
       
     │Group Column │Group Column │                                              
       
     │      A      │      B      │             Step 1:  hash(grp_key1, 
grp_key2) is    
     └─────────────┴─────────────┘                     computed (vectorized)    
       
           ...           ...                                                    
       
     ┌─────────────┬─────────────┐             Step 2: bucket for that hash 
value is   
     │  grp_key1   │  grp_key2   │                           obtained           
       
     └─────────────┴─────────────┘                                              
       
           ...           ...                Step 3: Validate that the values 
stored in 
                                             the bucket are the same as the 
input key  
                                              (aka that there are no hash 
collisions)  
                                                                                
       
                                                                                
       
                                                                                
       
                                                                                
       
                                                                                
       
           eq_array is used for step 3                                          
       
                                                                                
       
   ```
   
   I am pretty sure this code is correct, though since it is general purpose 
(works for all types) there is non trivial dispatch overhead
   
   If you are trying to speed up a distinct aggregate calculation I suggest you 
look into special casing group keys which are native types and which can be 
packed into fixed length byte arrays (so they can be compared using mem 
comparisons rather than dispatching on each column) 
   
   Another way of saying this is "don't try and remove `eq_array` but instead 
try to remove the use of `Scalar` entirely
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #1456: The Eq method in HashAggregate takes up a lot of time, how to optimize it

Reply via email to