Re: [HACKERS] Hash support for arrays

2010-11-04 Thread Dean Rasheed
On 3 November 2010 09:24, Nicolas Barbier nicolas.barb...@gmail.com wrote:
 2010/11/2 Kenneth Marshall k...@rice.edu:

 Given that our hash implimentation mixes the input data well (It does.
 I tested it.) then a simple rotate-and-xor method is all that should
 be needed to maintain all of the needed information. The original
 hash function has done the heavy lifting in this case.

 Even with the perfect hash function for the elements, certain
 combinations of elements could still lead to massive collisions. E.g.,
 if repeated values are typical in the input data we are talking about,
 then the rotate-and-xor method would still lead to collisions between
 any array of the same values of certain lengths, regardless of the
 value. In Tom's implementation, as he mentioned before, those
 problematical lengths would be multiples of 32 (e.g., an array of 32
 1s would collide with an array of 32 2s would collide with an array of
 32 3s, etc).


Yeah, rotate-and-xor is a pretty weak hashing algorithm, since any
array of 32 identical elements will hash to either 0 or -1. Similarly
various permutations or multiples of that array length will cause it
to perform badly.

The multiply-by-m algorithm doesn't have that weakness, provided m is
chosen carefully. There are a couple of qualities a good algorithm
should possess:

1). The bits from the individual element hash values should be
distributed evenly so that no 2 different hash values would result in
the same contribution to the final value. This is easy to achieve -
just make sure that m is odd.

2). The way that each element's hash value bits are distributed should
be different from the way that every other element's hash value bits
are distributed. m=31 achieves this pretty well, although there are
plenty of other equally valid choices.

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-04 Thread Kenneth Marshall
On Thu, Nov 04, 2010 at 10:00:40AM +, Dean Rasheed wrote:
 On 3 November 2010 09:24, Nicolas Barbier nicolas.barb...@gmail.com wrote:
  2010/11/2 Kenneth Marshall k...@rice.edu:
 
  Given that our hash implimentation mixes the input data well (It does.
  I tested it.) then a simple rotate-and-xor method is all that should
  be needed to maintain all of the needed information. The original
  hash function has done the heavy lifting in this case.
 
  Even with the perfect hash function for the elements, certain
  combinations of elements could still lead to massive collisions. E.g.,
  if repeated values are typical in the input data we are talking about,
  then the rotate-and-xor method would still lead to collisions between
  any array of the same values of certain lengths, regardless of the
  value. In Tom's implementation, as he mentioned before, those
  problematical lengths would be multiples of 32 (e.g., an array of 32
  1s would collide with an array of 32 2s would collide with an array of
  32 3s, etc).
 
 
 Yeah, rotate-and-xor is a pretty weak hashing algorithm, since any
 array of 32 identical elements will hash to either 0 or -1. Similarly
 various permutations or multiples of that array length will cause it
 to perform badly.
 
 The multiply-by-m algorithm doesn't have that weakness, provided m is
 chosen carefully. There are a couple of qualities a good algorithm
 should possess:
 
 1). The bits from the individual element hash values should be
 distributed evenly so that no 2 different hash values would result in
 the same contribution to the final value. This is easy to achieve -
 just make sure that m is odd.
 
 2). The way that each element's hash value bits are distributed should
 be different from the way that every other element's hash value bits
 are distributed. m=31 achieves this pretty well, although there are
 plenty of other equally valid choices.
 
 Regards,
 Dean
 
Hi Dean,

In my comment yesterday, I included a simple function that would
allow us to leverage our current hash functions mixing process to
scramble the bits effectively and retaining the maximum amount of
information in the hash.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-03 Thread Nicolas Barbier
2010/11/2 Kenneth Marshall k...@rice.edu:

 Given that our hash implimentation mixes the input data well (It does.
 I tested it.) then a simple rotate-and-xor method is all that should
 be needed to maintain all of the needed information. The original
 hash function has done the heavy lifting in this case.

Even with the perfect hash function for the elements, certain
combinations of elements could still lead to massive collisions. E.g.,
if repeated values are typical in the input data we are talking about,
then the rotate-and-xor method would still lead to collisions between
any array of the same values of certain lengths, regardless of the
value. In Tom's implementation, as he mentioned before, those
problematical lengths would be multiples of 32 (e.g., an array of 32
1s would collide with an array of 32 2s would collide with an array of
32 3s, etc).

Nicolas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-03 Thread Kenneth Marshall
On Wed, Nov 03, 2010 at 10:24:16AM +0100, Nicolas Barbier wrote:
 2010/11/2 Kenneth Marshall k...@rice.edu:
 
  Given that our hash implimentation mixes the input data well (It does.
  I tested it.) then a simple rotate-and-xor method is all that should
  be needed to maintain all of the needed information. The original
  hash function has done the heavy lifting in this case.
 
 Even with the perfect hash function for the elements, certain
 combinations of elements could still lead to massive collisions. E.g.,
 if repeated values are typical in the input data we are talking about,
 then the rotate-and-xor method would still lead to collisions between
 any array of the same values of certain lengths, regardless of the
 value. In Tom's implementation, as he mentioned before, those
 problematical lengths would be multiples of 32 (e.g., an array of 32
 1s would collide with an array of 32 2s would collide with an array of
 32 3s, etc).
 
 Nicolas
 

True. I just took another look at our defined hash functions and it
looks like we can make a simple variant of hash_uint32() that we
can use as a stream checksum. The only thing missing is that ability
to pass in the current 32-bit hash value as a starting seed to add
the next 32-bit value. Something like this should work:

Datum
hash_uint32(uint32 k, uint32 initval)
{
register uint32 a,
b,
c;

a = b = c = 0x9e3779b9 + (uint32) sizeof(uint32) + 3923095 + initval;
a += k;

final(a, b, c);

/* report the result */
return UInt32GetDatum(c);
}

Then if you pass in the current value as the initval, it should mix
well each additional 32-bit hash value.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Robert Haas
On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 marcin mank marcin.m...@gmail.com writes:
 On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 3. To hash, apply the element type's hash function to each array
 element.  Combine these values by rotating the accumulator left
 one bit at each step and xor'ing in the next element's hash value.

 Thoughts?  In particular, is anyone aware of a better method
 for combining the element hash values?

 This would make the hash the same for arrays with elements 32 apart swapped.

 Well, there are *always* going to be cases where you get the same hash
 value for two different inputs; it's unavoidable given that you have to
 combine N 32-bit hash values into only one 32-bit output.

Sure.  The goal is to make those hard to predict, though.  I think
multiply by 31 and add the next value is a fairly standard way of
getting that behavior.  It mixes up the bits a lot more than just
left-shifting by a variable offset.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 marcin mank marcin.m...@gmail.com writes:
 This would make the hash the same for arrays with elements 32 apart swapped.
 
 Well, there are *always* going to be cases where you get the same hash
 value for two different inputs; it's unavoidable given that you have to
 combine N 32-bit hash values into only one 32-bit output.

 Sure.  The goal is to make those hard to predict, though.

Really?  I think I don't understand when this fails isn't obviously
better than being able to predict when it fails ...

 I think
 multiply by 31 and add the next value is a fairly standard way of
 getting that behavior.  It mixes up the bits a lot more than just
 left-shifting by a variable offset.

What concerns me about that is that it tends to push the bits to the
left --- I think the effects of the earlier inputs are going to overflow
out as you incorporate a lot of newer inputs.  What you want is a scheme
where every input item affects all the bits of the final result about
equally, and my gut feeling is this doesn't provide that.

I'd be happier about this approach if there were some actual theory
behind it ... maybe there's some analysis out there, but the one link
that was given was just about entirely unconvincing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Alvaro Herrera
Excerpts from Tom Lane's message of mar nov 02 15:21:31 -0300 2010:

 What concerns me about that is that it tends to push the bits to the
 left --- I think the effects of the earlier inputs are going to overflow
 out as you incorporate a lot of newer inputs.  What you want is a scheme
 where every input item affects all the bits of the final result about
 equally, and my gut feeling is this doesn't provide that.

Ahh, now I understand what you meant with rotating the bits in the
original email in this thread.  You're not simply shifting.  Clever.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Robert Haas
On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 marcin mank marcin.m...@gmail.com writes:
 This would make the hash the same for arrays with elements 32 apart 
 swapped.

 Well, there are *always* going to be cases where you get the same hash
 value for two different inputs; it's unavoidable given that you have to
 combine N 32-bit hash values into only one 32-bit output.

 Sure.  The goal is to make those hard to predict, though.

 Really?  I think I don't understand when this fails isn't obviously
 better than being able to predict when it fails ...

Isn't that the whole point of hash functions?  Collisions are
inevitable, but you want them to be unpredictable.  If you want a hash
function with predictable collision behavior, just has the first
element.  It'll be highly predictable AND wicked fast.

 I think
 multiply by 31 and add the next value is a fairly standard way of
 getting that behavior.  It mixes up the bits a lot more than just
 left-shifting by a variable offset.

 What concerns me about that is that it tends to push the bits to the
 left --- I think the effects of the earlier inputs are going to overflow
 out as you incorporate a lot of newer inputs.  What you want is a scheme
 where every input item affects all the bits of the final result about
 equally, and my gut feeling is this doesn't provide that.

I don't think so.  That would be a problem if you multiplied by an
even number.  You can see that you don't get dups if you just add in
the same value over and over:

#include stdio.h

int
main(int argc, char **argv)
{
unsigned hv = 1;
unsigned i;

for (i = 0; i  100; ++i)
{
hv = hv * 31 + 1;
printf(%08x\n, hv);
}
return 0;
}

No dups.  Also, if you change that first hv = 1 line to hv = 2 and
rerun, that sequence also has no dups, and no collisions with the hv =
1 sequence.

 I'd be happier about this approach if there were some actual theory
 behind it ... maybe there's some analysis out there, but the one link
 that was given was just about entirely unconvincing.

I think it's from Knuth, though unfortunately I don't have a copy to
check.  There are probably better algorithms out there, but this one's
pretty simple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 
 The goal is to make those hard to predict, though.

 Really?  I think I don't understand when this fails isn't
 obviously better than being able to predict when it fails ...
 
 Isn't that the whole point of hash functions?  Collisions are
 inevitable, but you want them to be unpredictable.
 
Are you sure you're not confusing attributes of a good random number
generator with hash generation?  (They have much in common, but I've
only seen concern with hard to predict when working on random
number algorithms, as for financial audits or jury selection.)
 
 If you want a hash function with predictable collision behavior,
 just has the first element.  It'll be highly predictable AND
 wicked fast.
 
You're confusing unpredictable with widely distributed, I think.
There's no reason that the hash value of an integer of the same size
as the hash shouldn't be the integer itself, for example.  It's hard
to get more predictable than that, yet it works well for hash
lookups.
 
I know that for a hash of a character string, the total = (total *
31) + nextcharacter has been shown to be effective.  That's hardly
random or hard to predict, but it does tend to spread out the hash
values well.  Whether it is as good for arrays, I don't know.  It
seems like a reasonable place to start, though, since a character
string is rather like an array of characters.
 
 What concerns me about that is that it tends to push the bits to
 the left --- I think the effects of the earlier inputs are going
 to overflow out as you incorporate a lot of newer inputs.  What
 you want is a scheme where every input item affects all the bits
 of the final result about equally, and my gut feeling is this
 doesn't provide that.
 
 I don't think so.  That would be a problem if you multiplied by an
 even number.  You can see that you don't get dups if you just add
 in the same value over and over:
 
Yeah, that 1 in the low order position ensures that the impact of
any value which is ever added into the total is never entirely lost.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Robert Haas
On Tue, Nov 2, 2010 at 12:34 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:

 The goal is to make those hard to predict, though.

 Really?  I think I don't understand when this fails isn't
 obviously better than being able to predict when it fails ...

 Isn't that the whole point of hash functions?  Collisions are
 inevitable, but you want them to be unpredictable.

 Are you sure you're not confusing attributes of a good random number
 generator with hash generation?  (They have much in common, but I've
 only seen concern with hard to predict when working on random
 number algorithms, as for financial audits or jury selection.)

 If you want a hash function with predictable collision behavior,
 just has the first element.  It'll be highly predictable AND
 wicked fast.

 You're confusing unpredictable with widely distributed, I think.
 There's no reason that the hash value of an integer of the same size
 as the hash shouldn't be the integer itself, for example.  It's hard
 to get more predictable than that, yet it works well for hash
 lookups.

Well, no, not really.  For example, it may be that you have a hash
table with N buckets and values that of the form N+k, 2N+k, 3N+k, 
 and everything will collide.

If you do some mixing of the bits, that case is far less likely to
occur in practice.

See for example http://www.azillionmonkeys.com/qed/hash.html

 I know that for a hash of a character string, the total = (total *
 31) + nextcharacter has been shown to be effective.  That's hardly
 random or hard to predict, but it does tend to spread out the hash
 values well.  Whether it is as good for arrays, I don't know.  It
 seems like a reasonable place to start, though, since a character
 string is rather like an array of characters.

That was my thought.

 What concerns me about that is that it tends to push the bits to
 the left --- I think the effects of the earlier inputs are going
 to overflow out as you incorporate a lot of newer inputs.  What
 you want is a scheme where every input item affects all the bits
 of the final result about equally, and my gut feeling is this
 doesn't provide that.

 I don't think so.  That would be a problem if you multiplied by an
 even number.  You can see that you don't get dups if you just add
 in the same value over and over:

 Yeah, that 1 in the low order position ensures that the impact of
 any value which is ever added into the total is never entirely lost.

Right...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 There's no reason that the hash value of an integer of the same
 size as the hash shouldn't be the integer itself, for example. 
 It's hard to get more predictable than that, yet it works well
 for hash lookups.
 
 Well, no, not really.  For example, it may be that you have a hash
 table with N buckets and values that of the form N+k, 2N+k, 3N+k,
  and everything will collide.
 
That hardly seems convincing if the integer is a sequentially
assigned number, as with many ID columns; but if you want an
algorithm which will work well with numbers which might be assigned
in a pattern with a high correlation with some unpredictable number
of buckets -- sure, it won't work well without some scrambling.  For
sequentially assigned numbers, that scrambling would have a cost and
would be of no value.  I guess it depend a bit on your use case and
goals.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Really?  I think I don't understand when this fails isn't obviously
 better than being able to predict when it fails ...

 Isn't that the whole point of hash functions?  Collisions are
 inevitable, but you want them to be unpredictable.  If you want a hash
 function with predictable collision behavior, just has the first
 element.  It'll be highly predictable AND wicked fast.

That seems like a rather poor straw man, since it suffers from exactly
the defect I'm complaining about, namely failing to consider all the
input values equally.

 I'd be happier about this approach if there were some actual theory
 behind it ... maybe there's some analysis out there, but the one link
 that was given was just about entirely unconvincing.

 I think it's from Knuth, though unfortunately I don't have a copy to
 check.  There are probably better algorithms out there, but this one's
 pretty simple.

I don't see anything in Knuth suggesting a multiplier of 31.  His
recommendation for a multiplier, if you're going to use multiplicative
hashing, is wordsize/phi (phi being the golden ratio) ... and he also
wants you to keep the high order not the low order bits of the product.

However, this is largely beside the point, because that theory, as well
as the Java code you're arguing from, has to do with the initial hashing
of a raw sequence of input items.  Not with combining some existing hash
values.  The rotate-and-xor method I suggested for that is borrowed
exactly from section 6.4 of Knuth (page 512, in the first edition of
volume 3).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Kenneth Marshall
On Tue, Nov 02, 2010 at 04:42:19PM -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Really? ?I think I don't understand when this fails isn't obviously
  better than being able to predict when it fails ...
 
  Isn't that the whole point of hash functions?  Collisions are
  inevitable, but you want them to be unpredictable.  If you want a hash
  function with predictable collision behavior, just has the first
  element.  It'll be highly predictable AND wicked fast.
 
 That seems like a rather poor straw man, since it suffers from exactly
 the defect I'm complaining about, namely failing to consider all the
 input values equally.
 
  I'd be happier about this approach if there were some actual theory
  behind it ... maybe there's some analysis out there, but the one link
  that was given was just about entirely unconvincing.
 
  I think it's from Knuth, though unfortunately I don't have a copy to
  check.  There are probably better algorithms out there, but this one's
  pretty simple.
 
 I don't see anything in Knuth suggesting a multiplier of 31.  His
 recommendation for a multiplier, if you're going to use multiplicative
 hashing, is wordsize/phi (phi being the golden ratio) ... and he also
 wants you to keep the high order not the low order bits of the product.
 
 However, this is largely beside the point, because that theory, as well
 as the Java code you're arguing from, has to do with the initial hashing
 of a raw sequence of input items.  Not with combining some existing hash
 values.  The rotate-and-xor method I suggested for that is borrowed
 exactly from section 6.4 of Knuth (page 512, in the first edition of
 volume 3).
 
   regards, tom lane
 

Given that our hash implimentation mixes the input data well (It does.
I tested it.) then a simple rotate-and-xor method is all that should
be needed to maintain all of the needed information. The original
hash function has done the heavy lifting in this case.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Robert Haas
On Nov 2, 2010, at 1:42 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 However, this is largely beside the point, because that theory, as well
 as the Java code you're arguing from, has to do with the initial hashing
 of a raw sequence of input items.  Not with combining some existing hash
 values.  The rotate-and-xor method I suggested for that is borrowed
 exactly from section 6.4 of Knuth (page 512, in the first edition of
 volume 3).

It seems undesirable to me to have a situation where transposing two array 
elements doesn't change the hash value.  But I just work here.

...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Nov 2, 2010, at 1:42 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 However, this is largely beside the point, because that theory, as well
 as the Java code you're arguing from, has to do with the initial hashing
 of a raw sequence of input items.  Not with combining some existing hash
 values.  The rotate-and-xor method I suggested for that is borrowed
 exactly from section 6.4 of Knuth (page 512, in the first edition of
 volume 3).

 It seems undesirable to me to have a situation where transposing two array 
 elements doesn't change the hash value.  But I just work here.

[ shrug... ]  There are always going to be collisions, and you're
overstating the importance of this one (only some transpositions will
result in a duplicate hash, not any transposition).

What's actually *important*, for our purposes, is that all bits of the
final hash value depend in a reasonably uniform way on all bits of the
input hash values.  If we don't have that property, then bucket numbers
(which we form by taking the low-order N bits of the final hash, where
N isn't known in advance) won't be as well distributed as we'd like.

It's possible that the multiply-by-31 method is as good as the
rotate-and-xor method by that measure, or even better; but it's far from
obvious that it's better.  And I'm not convinced that the multiply
method has a pedigree that should encourage me to take it on faith.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Sam Mason
On Sat, Oct 30, 2010 at 01:01:44PM -0400, Tom Lane wrote:
 marcin mank marcin.m...@gmail.com writes:
  This is what boost does:
  http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html
 
 Hmm.  I am reminded of Knuth's famous dictum: never generate random
 numbers with a method chosen at random.  Is there any actual theory
 behind that algorithm, and if so what is it?  The combination of
 shifting with addition (not xor) seems more likely to lead to weird
 cancellations than any improvement in the hash behavior.

As far as I can tell the boost combiner comes from:

  http://goanna.cs.rmit.edu.au/~jz/fulltext/jasist-tch.pdf

Which seems to be solving a completely different problem and I can't see
how relevant it is to the design of a hash combiner.  The fact that they
get trivial details wrong like 32bit integers going up to 2^32 rather
than 2^32-1 doesn't inspire confidence.

One subtle point I noticed on the boost mailing list was that they
didn't want [[1],2] hashing to the same value as [1,2].  I'm pretty sure
that this sort of construction isn't possible to express in PG, but
thought I should mention it just in case.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-02 Thread Ron Mayer
Tom Lane wrote:
 It's possible that the multiply-by-31 method is as good as the
 rotate-and-xor method by that measure, or even better; but it's far from
 obvious that it's better.  And I'm not convinced that the multiply
 method has a pedigree that should encourage me to take it on faith.

Short summary:
  * some history of it's trivia follows
  * (nothing here suggests it's better - just old and common and cheap)


Longer - some trivia regarding its pedigree:

It (or at least a *31 variant) seems to have a history of advocacy
going back to Chris Torek in 1990:
http://groups.google.com/group/comp.lang.c/browse_thread/thread/28c2095282f0c1b5/193be99e9836791b?q=#193be99e9836791b

  X#defineHASH(str, h, p) \
  X   for (p = str, h = 0; *p;) h = (h  5) - h + *p++

and gets referred to in Usenet papers in the early 90's as well:
http://www.usenix.com/publications/library/proceedings/sa92/salz.pdf


Regarding why the magic number 31 [or 33 which also often comes
up] apparently the only thing magic about it is that it's an
odd number != 1.The rest of the odd numbers work about as well
according to this guy who tried to explain it:

http://svn.eu.apache.org/repos/asf/apr/apr/trunk/tables/apr_hash.c
 * The magic of number 33, i.e. why it works better than many other
 * constants, prime or not, has never been adequately explained by
 * anyone. So I try an explanation: if one experimentally tests all
 * multipliers between 1 and 256 (as I did while writing a low-level
 * data structure library some time ago) one detects that even
 * numbers are not useable at all. The remaining 128 odd numbers
 * (except for the number 1) work more or less all equally well.
 * They all distribute in an acceptable way and this way fill a hash
 * table with an average percent of approx. 86%.
 *
 * If one compares the chi^2 values of the variants (see
 * Bob Jenkins ``Hashing Frequently Asked Questions'' at
 * http://burtleburtle.net/bob/hash/hashfaq.html for a description
 * of chi^2), the number 33 not even has the best value. But the
 * number 33 and a few other equally good numbers like 17, 31, 63,
 * 127 and 129 have nevertheless a great advantage to the remaining
 * numbers in the large set of possible multipliers: their multiply
 * operation can be replaced by a faster operation based on just one
 * shift plus either a single addition or subtraction operation. And
 * because a hash function has to both distribute good _and_ has to
 * be very fast to compute, those few numbers should be preferred.
 ...

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-11-01 Thread hernan gonzalez
Hmm.  I am reminded of Knuth's famous dictum: never generate random
numbers with a method chosen at random.  Is there any actual theory
behind that algorithm, and if so what is it?  The combination of
shifting with addition (not xor) seems more likely to lead to weird
cancellations than any improvement in the hash behavior.
For what's worth: the standard way in Java is multiplying by 31.
Which 31 amounts to a 5 bit shift and a substraction.

In general, a prime number is recommended though one would like that
Knuth made some analysis here... it all seems mostly empirical.

http://computinglife.wordpress.com/2008/11/20/why-do-hash-functions-use-prime-numbers/

Perhaps the 5 bit shift is related to the spread of ascii charpoints,
as that hashcode() implementation was mainly tested and implemented
for a String. But now it's also suggested for general objects, and
it's even specified for standard collections: for example

http://download.oracle.com/javase/6/docs/api/java/util/List.html#hashCode()


Hernán J. González


[HACKERS] Hash support for arrays

2010-10-30 Thread Tom Lane
I was reminded today that I'd promised to do $SUBJECT awhile back.
It's worth having so that hash joins and hash aggregation can work
on array values.  I believe this is a fairly straightforward
exercise:

1. Add a hash opclass that accepts ANYARRAY, similar to the existing
btree opclass for ANYARRAY.  It will work for any array type whose
element type has a hash opclass.

2. Coding is much like the array_cmp support code, including caching
the lookup of the element type's hash function.

3. To hash, apply the element type's hash function to each array
element.  Combine these values by rotating the accumulator left
one bit at each step and xor'ing in the next element's hash value.

Thoughts?  In particular, is anyone aware of a better method
for combining the element hash values?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-10-30 Thread marcin mank
On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 3. To hash, apply the element type's hash function to each array
 element.  Combine these values by rotating the accumulator left
 one bit at each step and xor'ing in the next element's hash value.

 Thoughts?  In particular, is anyone aware of a better method
 for combining the element hash values?


This would make the hash the same for arrays with elements 32 apart swapped.

This is what boost does:
http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html

Greetings

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-10-30 Thread Tom Lane
marcin mank marcin.m...@gmail.com writes:
 On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 3. To hash, apply the element type's hash function to each array
 element.  Combine these values by rotating the accumulator left
 one bit at each step and xor'ing in the next element's hash value.
 
 Thoughts?  In particular, is anyone aware of a better method
 for combining the element hash values?

 This would make the hash the same for arrays with elements 32 apart swapped.

Well, there are *always* going to be cases where you get the same hash
value for two different inputs; it's unavoidable given that you have to
combine N 32-bit hash values into only one 32-bit output.

 This is what boost does:
 http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html

Hmm.  I am reminded of Knuth's famous dictum: never generate random
numbers with a method chosen at random.  Is there any actual theory
behind that algorithm, and if so what is it?  The combination of
shifting with addition (not xor) seems more likely to lead to weird
cancellations than any improvement in the hash behavior.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-10-30 Thread Greg Stark
On Sat, Oct 30, 2010 at 9:21 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Thoughts?  In particular, is anyone aware of a better method
 for combining the element hash values?


The obvious thing to do seems like it would be to feed the individual
values back into the regular hash function.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-10-30 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 On Sat, Oct 30, 2010 at 9:21 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Thoughts?  In particular, is anyone aware of a better method
 for combining the element hash values?

 The obvious thing to do seems like it would be to feed the individual
 values back into the regular hash function.

Hmm, you mean use hash_any on the sequence of hash values?  Interesting
idea; but hash_any doesn't look very amenable to being invoked
incrementally, and I don't think I want to construct a temp array with
enough space for the hashes of all the items in the input array ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-10-30 Thread Greg Stark
On Sat, Oct 30, 2010 at 1:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Hmm, you mean use hash_any on the sequence of hash values?  Interesting
 idea; but hash_any doesn't look very amenable to being invoked
 incrementally, and I don't think I want to construct a temp array with
 enough space for the hashes of all the items in the input array ...

I had rather hoped without evidence that it would be easy enough to
use incrementally. Looking now I see your point; it doesn't lend
itself to that at all.

I suppose you could use crc where our interface does allow incremental use.

I'm not really familiar enough with the math to know whether CRC is
any better than your XOR mixing. I suspect they're pretty similar. I
suppose if you have arrays of various lengths containing repetitions
of a single value than the non-CRC would produce a hash of 0 for all
arrays with a length that's a multiple of 32. I'm not sure what CRC
would do.

Oh, one question that occurred to me you didn't mention including the
bounds of the array or the null bitmap in the hash function. I suppose
it's correct with or without them (or in the case of the null bitmap
another way to put it is the choice of the hash value to represent
null is arbitrary).


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash support for arrays

2010-10-30 Thread Tom Lane
Greg Stark gsst...@mit.edu writes:
 I suppose you could use crc where our interface does allow incremental use.

Hm, that's a thought.  I'm not sure though whether CRC really has the
behavior you want from a hash function, in particular that all the bits
are independent (a/k/a taking N low-order bits is a reasonable way to
cut the hash down to a suitable bucket index).  Anybody know?

 I'm not really familiar enough with the math to know whether CRC is
 any better than your XOR mixing. I suspect they're pretty similar. I
 suppose if you have arrays of various lengths containing repetitions
 of a single value than the non-CRC would produce a hash of 0 for all
 arrays with a length that's a multiple of 32. I'm not sure what CRC
 would do.

Some quick experimenting suggests that you get -1 from an array of 32 of
the same value, then 0 from 64 of the same, etc.  This doesn't bother me
particularly though.  There are always going to be some situations that
produce degenerate hash values.

 Oh, one question that occurred to me you didn't mention including the
 bounds of the array or the null bitmap in the hash function. I suppose
 it's correct with or without them (or in the case of the null bitmap
 another way to put it is the choice of the hash value to represent
 null is arbitrary).

Yeah.  I have it coded to treat a null element as if it had a hash value
of zero.  I don't see a great need to consider the array bounds per se.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers