> Also, I don't understand why there are two versions of the hash table
> ("hashing32" and "hashing64" apparently). What's the rationale? How is
> the user meant to choose between them? Say a Substrait plan is being
> executed: which hashing variant is chosen and why?

It's not user-configurable.  The hash-join and hash-group-by always use the
32-bit variant.  The asof-join always uses the 64-bit variant.  I wouldn't
stress too much about the hash-join.  It is a very memory intensive
operation and my guess is that by the time you have enough keys to worry
about hash uniqueness you should probably be doing an out-of-core join
anyways.  The hash-join implementation is also fairly tolerant to duplicate
keys anyways.  I believe our hash-join performance is unlikely to be the
bottleneck in most cases.

It might make more sense to use the 64-bit variant for the group-by, as we
are normally only storing the hash-to-group-id table itself in those
cases.  Solid benchmarking would probably be needed regardless.

On Mon, Jul 24, 2023 at 1:19 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi,
>
> Le 21/07/2023 à 15:58, Yaron Gvili a écrit :
> > A first approach I found is using `Hashing32` and `Hashing64`. This
> approach seems to be useful for hashing the fields composing a key of
> multiple rows when joining. However, it has a couple of drawbacks. One
> drawback is that if the number of distinct keys is large (like in the scale
> of a million or so) then the probability of hash collision may no longer be
> acceptable for some applications, more so when using `Hashing32`. Another
> drawback that I noticed in my experiments is that the common `N/A` and `0`
> integer values both hash to 0 and thus collide.
>
> Ouch... so if N/A does have the same hash value as a common non-null
> value (0), this should be fixed.
>
> Also, I don't understand why there are two versions of the hash table
> ("hashing32" and "hashing64" apparently). What's the rationale? How is
> the user meant to choose between them? Say a Substrait plan is being
> executed: which hashing variant is chosen and why?
>
> I don't think 32-bit hashing is a good idea when operating on large
> data. Unless the hash function is exceptionally good, you may get lots
> of hash collisions. It's nice to have a SIMD-accelerated hash table, but
> less so if access times degenerate to O(n)...
>
> So IMHO we should only have one hashing variant with a 64-bit output.
> And make sure it doesn't have trivial collisions on common data patterns
> (such as nulls and zeros, or clustered integer ranges).
>
> > A second approach I found is by serializing the Arrow structures
> (possibly by streaming) and hashing using functions in `util/hashing.h`. I
> didn't yet look into what properties these hash functions have except for
> the documented high performance. In particular, I don't know whether they
> have unfortunate hash collisions and, more generally, what is the
> probability of hash collision. I also don't know whether they are designed
> for efficient use in the context of joining.
>
> Those hash functions shouldn't have unfortunate hash, but they were not
> exercised on real-world data at the time. I have no idea whether they
> are efficient in the context of joining, as they have been written much
> earlier than our joining implementation.
>
> Regards
>
> Antoine.
>

Reply via email to