alamb commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895383017


   > If I recall, we will need to perform N x M comparisons where N is the 
number of rows in the batch and M the distinct number of items in a group, 
around here, roughly represented in for (row, hash) in 
batch_hashes.into_iter().enumerate() and the inner 
group_values.iter()....all(op).
   
   Yes, I think that is accurate. I would love to figure out how to use a 
vectorized approach 
   
   In #808  we have a hash table mapping (effectively) `[ScalarValue]` -> 
(aggregate data)` and we have input `[ArrayRef]`. The hash calculation is 
vectorized (so that is good), and then there is a loop that looks like
   
   ```rust
   for (hash, index) in ... {
     // look up entry in hash table for row `index`
     // check if the values at `[ArrayRefs]` @ `index` are the same as in the 
entry in the table (what this PR's code, `array_eq`, is used for)
     // ...
   }
   ```
   
   In order to vectorize this calculation, I think we would have to somehow 
vectorize both the lookup in the table as well as the comparison. 
   
   I suppose we could potentially copy the existing keys out of the hash 
entries (so they are contiguous) to do a vectorized comparison but my intuition 
is that the cost of this copy would outweigh any gains due to vectorization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to