Patrick Roehl asks:
> The entries are short (4 bytes) and the key is the entire entry. My
> understanding is that in this situation a hash table has no benefit.
> Is this correct?
No, that is not correct. I would select a hash algorithm that would
yield a hash value in the range of 0 to (a power of 2)-1, such as 255
or 511 or 1023, selected so as to be large enough to typically result
in a "low" number of collisions (hence length of the overflow chain).
You could, for example, start out with a {0-1023} hashing, which means
you need two 4KB pages to hold 1024 entries, and a cell pool whose cell
size is 8 bytes (one word to hold the "key" and the other word to hold
the pointer to the next overflow "key"). That would be good for up to,
say, 10,000 unique keys, I'd expect, but you'd have overflow chains of
around 30 to 60 keys, typically, for that many keys.
The next increase in size of the primary hash table (to 16KB) is not
likely to halve the average length of the overflow chains, unless you
get lucky and have a perfectly adapted (to the incoming keys) hashing
algorithm, so an 8KB table might be about the best you could expect to
do. But that's for a number of "keys" presented on the order of 4,000
to 10,000.
I have been successful most times in the past using this simple rule
of thumb: The number of BYTES in the hash table is the closest power
of 2 to the maximum number of items expected to be hashed multiplied
by the key length and then divided by 2, 4, or 8, so as to yield a
seemingly "right-sized" table. The $64,000 issue of course is coming
up with a hash algorithm that works well with the actual keys that are
presented to yield a hash in the range of 0 to 2**n-1. Sometimes that
is easy (the first one thought of is good enough), and sometimes that
requires a little experimentation and prototyping. But if performance
is truly important, that's where you will get your payback.
Hash tables for truly large numbers of items (you say that there is
the potential of "a million entries" while the more typical case is
just "a few hundred") can consume large amounts of storage because a
4-byte pointer to the overflow chain means that the minimum size hash
table entry is 4+key_length bytes, unless you put the first hashing
key into a (the first) overflow entry (which is the best option if
the keys are variable length -- which yours are not). The problem in
your case is how to handle a potential million unique 4-byte keys.
Ultimately, in your case, with 4-byte keys, you are going to have to
have at least one million 8 byte (4 byte pointer + 4 byte key) entries
(either in the hash table itself or in overflow cells), because each
unique key will have to be recorded (so that you can recognize that it
is unique, or not, in the case of a hash code collision). That is, of
course, 8 million bytes. If you had a perfect hashing algorithm (one
that produced a unique hash code for each of your one million keys,
each time), then you would need an 8 MB hash table and a {0, 2**20-1}
hash code.
Since there is no perfect hashing algorithm that would never result in
any collisions, you're still going to have to allocate, for example, a
cell pool (whose element size is 4+4=8 bytes) to handle the collisions.
Ultimately, you will, for a million unique keys (which I assume is one
possible outcome) have to allocate a minimum of 8 MB (and typically a
bunch more) of storage for the hash table and overflow cells. How you
distribute that between the hash table and the overflow cell pool is
important, since you can allocate the original hash table so as to be
able to handle the most typical cases (numbers of unique keys), and
then let the collisions go into the overflow chains in the cell pool,
which you can simply (allow to) grow as necessary (adding more extents
to the cell pool). You will still end up, in the maximum (1 million
unique keys) case, with more than 8 MB of storage allocated to hash
and overflow, but the issue of sizing the hash table then comes down
to being able to know the distribution of the number of unique keys,
and using that value to select the base hash table size, so as to
balance the following considerations:
1. minimize the number of overflow cell allocations
2. minimize the amount of storage allocated for the hash table
in light of the fact that you will probably use a single hashing
algorithm in all cases (one geared towards the size of the hash
table that you choose). John Gilmore's last post has some good
suggestions on how to choose a good hashing algorithm. He notes
that you should insert the overflow keys (actually, in his view,
all keys, which is OK, but wastes 4 bytes for the first one) "[in]
ascending algebraic sequence." That is not actually necessary;
you can simply add them as they occur. I see no particular need
to sort them.
Another technique you might use if speed (i.e., low CPU overhead)
is critical is to point the primary hash table entries to another
hash table, of smaller size (and with a different hash algorithm),
which hashes only those keys that collide in the primary hash table.
This will require more storage, but can result in the most efficient
lookup and insert process (absent any considerations of performance
due to the additional storage required and referenced, of course).
If this is something that is going to happen "frequently" then I
would put the hash table and the cell pool for overflow dup keys
into a data space. It would then be simple to "zero" the hash
table and reset the cell pool to all-cells-available for the next
set of keys to be evaluated.
--
WB