Dandandan commented on issue #4973: URL: https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1613589972
> > I did some further experiment in the bucketing to see if this is still the case today [branch](https://github.com/apache/arrow-datafusion/commits/bucketing). > > This is very close to the description in the above MonetDB paper and written about here https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ but doesn't really improve join performance, even when creating buckets to avoid collisions > > @Dandandan does DF use SIMD in the hot paths of hash join? From my experiences LLVM auto-vectorization is pretty fragile and often times doesn't get triggered if the logic is a bit complex. I think even the `create_hashes` in aggregation doesn't use SIMD (I could be wrong there). @sunchao TBH I am not sure about whether everything generates optimal SIMD instructions, however in the hash join vectorization of `create_hashes`, vectorized equality checking (using plain arrow kernels) was definitely a big performance improvement compared to the earlier scalar-based approaches, this would probably be the case even without much use of SIMD. There are probably some possible improvements here and there wrt. generated code. `create_hashes` uses ahash internally, which is one of the faster algorithms - the hashing itself is usually not very expensive by now compared to other operations, it used to be. It just does the hashing + rehashing column-wise like in other systems. The above experiment is only about the conversion from using a `RawTable<(u64, u64))` to storing buckets in a `Vec<u64>` and having collision checking handled at the probing phase as discussed in the MonetDB paper. Maybe I'm a bit wrong here in my implementation (and the Photon whitepaper doesn't include enough details unfortunately) or something else could be optimized in order for it to be faster. Or it turns out that keeping a plain hashtable for values only might not be so bad at all(?) At least with the current approach I'm getting mixed results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
