[HACKERS] Hash Functions

Jeff Davis Thu, 11 May 2017 21:09:07 -0700

https://www.postgresql.org/message-id/camp0ubeo3fzzefie1vmc1ajkkrpxlnzqooaseu6cccco-c+...@mail.gmail.com


In that thread, I pointed out some important considerations for the
hash functions themselves. This is a follow-up, after I looked more
carefully.

1. The hash functions as they exist today aren't portable -- they can
return different results on different machines. That means using these
functions for hash partitioning would yield different contents for the
same partition on different architectures (and that's bad, considering
they are logical partitions and not some internal detail).

The core hashing algorithm is based on 32-bit chunks, so it's only
portable if the input is an int32 (or array of int32s). That's not
true for varchar (etc.) or numeric (which is based on an array of
int16s). There is a hack for int8, but see issue #2 below.

We could try to mark portable vs. unportable hash functions, but it
seems quite valuable to be able to logically partition on varchar, so
I think we should have some portable answer there. Another annoyance
is that we would have to assume container types are unportable, or
make them inherit the portability of the underlying type's hash
function.

We could revert 26043592 (copied Tom because that was his patch) to
make hash_any() go back to being portable -- do we know what that
speedup actually was? Maybe the benefit is smaller on newer
processors? Another option is to try to do some combination of
byteswapping and word-at-a-time, which might be better than
byte-at-a-time if the byteswapping is done with a native instruction.

2. hashint8() is a little dubious. If the input is positive, it XORs
the high bits; if negative, it XORs the complement of the high bits.
That works for compatibility with hashint2/4, but it does not cause
good hash properties[1]. I prefer that we either (a) upcast int2/4 to
int8 before hashing and then hash the whole 64 bits; or (b) if the
value is within range, downcast to int4, otherwise hash the whole 64
bits.

3. We should downcast numeric to int4/8 before hashing if it's in
range, so that it can be a part of the same opfamily as int2/4/8.

4. text/varchar should use hashbpchar() so that they can be part of
the same opfamily. Trailing blanks seem unlikely to be significant for
most real-world data anyway.

5. For salts[2], I don't think it's too hard to support them in an
optional way. We just allow the function to be a two-argument function
with a default. Places that care about specifying the salt may do so
if the function has pronargs==2, otherwise it just gets the default
value. If we have salts, I don't think having 64-bit hashes is very
important. If we run out of bits, we can just salt the hash function
differently and get more hash bits. This is not urgent and I believe
we should just implement salts when and if some algorithm needs them.

Regards,
    Jeff Davis

[1] You can a kind of mirroring in the hash outputs indicating bad mixing:

postgres=# select hashint8((2^32-100)::int8);
 hashint8
----------
 41320869
(1 row)

postgres=# select hashint8((-2^32-100)::int8);
 hashint8
----------
 41320869
(1 row)

postgres=# select hashint8((2^32-101)::int8);
  hashint8
-------------
 -1214482326
(1 row)

postgres=# select hashint8((-2^32-101)::int8);
  hashint8
-------------
 -1214482326
(1 row)


[2] Not sure I'm using the term "salt" properly. I really just mean a
way to affect the initial state, which I think is good enough for our
purposes.


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Hash Functions

Reply via email to