On 1 August 2016 at 08:19, Greg Stark <st...@mit.edu> wrote: > Surely combining multiple hashes is the same problem as hashing a block of > memory? Shouldn't we just use the same algorithm as hash_any()? >
Yes, I imagine that should work well. I suspect that Thomas's proposal would also probably work well, as would a number of other hashing algorithms with reasonable pedigree, such as the one used for array hashing. I don't have any particular preference, but I do know that what usually turns out to be disastrous is an arbitrary made-up formula like rotating and xor'ing. The last thing we should attempt to do is invent our own hashing algorithms. On that subject, while looking at hashfunc.c, I spotted that hashint8() has a very obvious deficiency, which causes disastrous performance with certain inputs: create table foo (a bigint); insert into foo select (x*2^32)+x from generate_series(1,1000000) g(x); analyse foo; select * from foo f1, foo f2 where f1.a=f2.a; With random values that query (using a hash join) takes around 2 seconds on my machine, but with the values above I estimate it will take 20 hours (although I didn't wait that long!). The reason is pretty obvious -- all the bigint values above hash to the same value. I'd suggest using hash_uint32() for values that fit in a 32-bit integer and hash_any() otherwise. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers