[GitHub] [arrow-datafusion] Dandandan commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregation

via GitHub Thu, 29 Jun 2023 11:08:19 -0700


Dandandan commented on issue #4973:
URL: 
https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1613589972


   > > I did some further experiment in the bucketing to see if this is still 
the case today 
[branch](https://github.com/apache/arrow-datafusion/commits/bucketing).
   > > This is very close to the description in the above MonetDB paper and 
written about here https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ 
but doesn't really improve join performance, even when creating buckets to 
avoid collisions
   > 
   > @Dandandan does DF use SIMD in the hot paths of hash join? From my 
experiences LLVM auto-vectorization is pretty fragile and often times doesn't 
get triggered if the logic is a bit complex. I think even the `create_hashes` 
in aggregation doesn't use SIMD (I could be wrong there).
   
   @sunchao TBH I am not sure about whether everything generates optimal SIMD 
instructions, however in the hash join vectorization of `create_hashes`, 
vectorized equality checking (using plain arrow kernels) was definitely a big 
performance improvement compared to the earlier scalar-based approaches, this 
would probably be the case even without much use of SIMD. There are probably 
some possible improvements here and there wrt. generated code. `create_hashes` 
uses ahash internally, which is one of the faster algorithms - the hashing 
itself is usually not very expensive by now compared to other operations, it 
used to be. It just does the hashing + rehashing column-wise like in other 
systems.
   
   The above experiment is only about the conversion from using a 
`RawTable<(u64, u64))` to storing buckets in a `Vec<u64>` and having collision 
checking handled at the probing phase as discussed in the MonetDB paper. Maybe 
I'm a bit wrong here in my implementation (and the Photon whitepaper doesn't 
include enough details unfortunately) or something else could be optimized in 
order for it to be faster. Or it turns out that keeping a plain hashtable for 
values only might not be so bad at all(?) At least with the current approach 
I'm getting mixed results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregation

Reply via email to