For a closed hash table, you want a high probability of uniform distribution
across all cells of the table.  For integers, sequential values have pretty
high probability in the real world which gives really bad spreading with an
identity hash thus leading to anomalously bad performance in a common case.

Using a reasonable permutation of the integers (even just xor with a
suitable value) would change the evil order of insertion to a very obscure
one.  This would keep the perfection of the hash while giving better hash
table performance.  The speed difference is likely to be undetectable.

For an open hash table, on the other hand, spreading values around in the
table is likely to be bad for performance due to memory locality.

Clearly mileage will vary, but for most applications, I would say that the
identity hash is pretty good.

On Mon, Jan 25, 2010 at 10:22 AM, Jake Mannix <jake.man...@gmail.com> wrote:

> I think the reasoning for wanting non-identity hashing comes from thoughts
> like
> this: a good hash function takes small (single bit) changes in the inputs,
> and
> produces typically 16 (32 / 2) bit changes in the output.  The identity
> mapping
> does not come close to satisfying this.
>
> Another, practical way to look at it is that often, the integers you get as
> inputs
> may come from consecutive sequences in a fixed small range (0 to N <<
> MAX_INTEGER),
> and you may, for example, be partitioning your hash outputs not by mod k,
> but
> instead by consecutive ranges (0 to N/k - 1, N/k to 2N/k, ... ).  The
> identity hash
> function would cause major problems here, while a good hash function (as
> defined
> above) would be fine with it.
>
> For the record, I agree with the java standard on this too, in that if
> someone
> really wants to gaurantee "nice" distribution on their hash outputs, they
> can
> do their own.  Having collisionless hashes which are optimally fast is
> best.
>
>  -jake
>
> On Mon, Jan 25, 2010 at 9:07 AM, Sean Owen <sro...@gmail.com> wrote:
>
> > (Out of curiosity what does the distribution have to do with it --
> > what's a distribution for which something besides identity is better?)
> >
> > On Mon, Jan 25, 2010 at 5:05 PM, Dawid Weiss <dawid.we...@gmail.com>
> > wrote:
> > > I was also thinking about this when implementing HPPC. Like I said,
> > > unless you know exactly that you have a weird data distribution,
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to