> Also, I don't understand why there are two versions of the hash table > ("hashing32" and "hashing64" apparently). What's the rationale? How is > the user meant to choose between them? Say a Substrait plan is being > executed: which hashing variant is chosen and why?
It's not user-configurable. The hash-join and hash-group-by always use the 32-bit variant. The asof-join always uses the 64-bit variant. I wouldn't stress too much about the hash-join. It is a very memory intensive operation and my guess is that by the time you have enough keys to worry about hash uniqueness you should probably be doing an out-of-core join anyways. The hash-join implementation is also fairly tolerant to duplicate keys anyways. I believe our hash-join performance is unlikely to be the bottleneck in most cases. It might make more sense to use the 64-bit variant for the group-by, as we are normally only storing the hash-to-group-id table itself in those cases. Solid benchmarking would probably be needed regardless. On Mon, Jul 24, 2023 at 1:19 AM Antoine Pitrou <anto...@python.org> wrote: > > Hi, > > Le 21/07/2023 à 15:58, Yaron Gvili a écrit : > > A first approach I found is using `Hashing32` and `Hashing64`. This > approach seems to be useful for hashing the fields composing a key of > multiple rows when joining. However, it has a couple of drawbacks. One > drawback is that if the number of distinct keys is large (like in the scale > of a million or so) then the probability of hash collision may no longer be > acceptable for some applications, more so when using `Hashing32`. Another > drawback that I noticed in my experiments is that the common `N/A` and `0` > integer values both hash to 0 and thus collide. > > Ouch... so if N/A does have the same hash value as a common non-null > value (0), this should be fixed. > > Also, I don't understand why there are two versions of the hash table > ("hashing32" and "hashing64" apparently). What's the rationale? How is > the user meant to choose between them? Say a Substrait plan is being > executed: which hashing variant is chosen and why? > > I don't think 32-bit hashing is a good idea when operating on large > data. Unless the hash function is exceptionally good, you may get lots > of hash collisions. It's nice to have a SIMD-accelerated hash table, but > less so if access times degenerate to O(n)... > > So IMHO we should only have one hashing variant with a 64-bit output. > And make sure it doesn't have trivial collisions on common data patterns > (such as nulls and zeros, or clustered integer ranges). > > > A second approach I found is by serializing the Arrow structures > (possibly by streaming) and hashing using functions in `util/hashing.h`. I > didn't yet look into what properties these hash functions have except for > the documented high performance. In particular, I don't know whether they > have unfortunate hash collisions and, more generally, what is the > probability of hash collision. I also don't know whether they are designed > for efficient use in the context of joining. > > Those hash functions shouldn't have unfortunate hash, but they were not > exercised on real-world data at the time. I have no idea whether they > are efficient in the context of joining, as they have been written much > earlier than our joining implementation. > > Regards > > Antoine. >