Re: IntHashSet, SentinelIntSet, SortedIntDocSet, ...

Dawid Weiss Thu, 31 Jan 2013 00:03:30 -0800

> Thanks for that advise; I trust your judgement!

I'm not being judgmental here, I speak from experience. I also though
it didn't make sense to rehash int keys when I was writing HPPC but
life quickly taught me the opposite. The problem was that hash values
took slots modulo 2^n and, for example, we had keys which were
constructed as:


primaryValue << 32 | secondaryValue

which then conflict for nearly _all_ secondary values, regardless of
the primaryValue. Also, rehashing in my experiments didn't really hurt
performance -- it was all dwarfed by other aspects in which hashmaps
were used (and I still suspect modern CPUs wait for memory accesses
there so you could calculate 10x more in the mean time and nothing
would happen).

> Do you have an opinion on copying IntHashSet directly to Lucene-core or
> some other approach?

Up to you, really. There are a few factors to consider:

1) if you know the distribution of your keys you can use specialized
hashing/ bucketing that will be matched to your data,
2) if you need juc compliance (iterators and such) Sebastiano Vigna's
fastutil has a nice set of interfaces that allows both boxed value
iterators and primitive iterators (without the need of going to boxed
types). fastutil's implementations are nearly identical to HPPC's and
the performance is similar (with differences at noise levels I'd say).
3) you need to decide whether to use open hashing or linear hashing.
Both have pros and cons. fastutil and HPPC use linear hashing because
it's more cache friendly -- an older version of both libraries will
have code for double hashing if you want to rip it.

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: IntHashSet, SentinelIntSet, SortedIntDocSet, ...

Reply via email to