Re: [Jprogramming] Separate chaining hash table

Elijah Stone Wed, 19 Jul 2023 02:10:31 -0700

The load factor of the table, at any given point in time, is the ratio of thenumber of keys occupying the table to the number of hash buckets. The maximumload factor is a parameter to the table (as matt says, you may not wish toexpose this parameter in the _interface_, but from the _implementation_perspective, it is a parameter): if, following an insert, the load factor everexceeds the max load factor, then the table size is increased and the table isrehashed. Notionally, if the load factor gets 'too high', then there will be'a lot' of keys in each hash bucket, so lookups get 'slow'; quantifying 'toohigh' and 'slow' is what gets you the max load factor. In your table, the maxload factor is 1.

There is another parameter which it is sensible for the table to have: a _min_load factor. If, following a delete, the load factor goes _below_ the minload factor, then the table size is _decreased_ and it is rehashed.Notionally, if the load factor gets 'too low', then there will be 'a lot' ofbuckets with no keys, wasting 'a lot' of space.

In particular, storing n keys in the table should require no more than O(n)space; since your table lacks a min load factor, performing n inserts and thenn deletes uses O(n) space to store no keys, so it has a space leak.

Notionally, clearing the table corresponds to deleting each key, one by one;so you could figure out, given a min load factor parameter, a current size,and a resizing policy, how many hash buckets would remain if all the keys weredeleted one by one. But this would be a bit frivolous and I see no reason tobother. Note in particular that, unless the min load factor is zero (whichwould be useless), the case when the table is empty is a degenerate one (sincethere is no number of hash buckets which can satisfy the min-load-factorinvariant), so you can pick an arbitrary number (maybe zero, maybe one, maybea number of hash buckets which would be acceptable if the table contained onlyone key). (Maintaining as an invariant that there is always at least one hashbucket, even when the table contains no occupants, simplifies the table andavoids special cases; my not mentioning this and suggesting to clear to zerobuckets was an oversight on my part.)

Note also that allowing the user to specify the number of hash buckets thetable initially contains can also violate the min load factor invariant.This is no accident. If the user specifies an overlarge number, and thennever inserts many keys into the table, that's a space leaks too. And it'seasy to misestimate how many keys you plan to insert into the table. If youpick a small number for them, and they end up inserting a large number of keysanyway, then it's not a big deal--you only take a constant (amortised)overhead for that. Hence, allowing the user to set the initial table size isa bad idea for the same reason as allowing them to set the max (and min) loadfactors is.

(You might have noticed a curious interaction between min and max load factor.The min load factor, max load factor, and resize policy need to be chosen withawareness of each other, so you don't end up in a situation where you insert akey, the table rehashes, remove a key, the table rehashes again, etc.)


On Tue, 18 Jul 2023, Luke D wrote:

Wow, I appreciate your thorough examination.

Nearly all of the shortcomings came from me following the simple interface
and behavior laid out in chapter 5 of Mark Weiss' "Data Structures and
Algorithm Analysis in C++" (I wanted to ensure that I had something working
as intended), such as using horners to begin with and reducing mod 2^32
(the book uses unsigned ints, though you are right about moving the mod
inside the reduction) and reversing the input string to mimic a for each
loop.

I guess the only reason for tacking on the zero is to handle zero-length

strings?
Frankly, I hadn't even considered empty strings, but rather, I was using 0
as a logical start value.

One alternative is to use the recently-added fold family of the

conjunctions
I have actually never used these before, but it appears that the equivalent
of the C++ code, found in Figure 5.4 of the aforementioned book and below,
would be (0 [F..((2 ^ 32x)|[+37x*]) 3 u: ]), which is much less
intimidating than I had originally thought and handles empty strings as you
mentioned.

unsigned int horners(const std::string& key, int tableSize) {
   unsigned int hashValue = 0;
   for (char c : key) hashValue = 37 * hashValue + c;
   return hashValue % s;
}

clear =: {{ 1 [ table =: a: #~ ts }

I should definitely set occupancy to 0 here regardless, but I am not sure
about setting the table size to 0 (reusing the initial table size to
possibly free up memory seems better). The goal of this method is to
simply empty all of the items inserted into the table, and my current,
naive understanding is that you would change how many buckets you have
when it's time to rehash, which is the situation Matt is describing in his
talk with the set max load factor method. Could you elaborate on this?

Best Regards,
Luke De La Cruz


On Tue, Jul 18, 2023 at 5:13 AM Elijah Stone <[email protected]> wrote:

Cool.

> if. contains y do. 0 return. end.

I think the more usual hash table interface, when asked to insert a key
that
matches an existing one, replaces the existing value with the new one.

> horners =. (2 ^ 32x) | [: (+37x&*)/ 0 |.@, 3 u: ]

- Reducing this mod 2^32 at end seems a bit pointless, as you are about to
reduce it mod the table size anyway.  Evaluating the intermediate
operations
mod 2^32, to avoid blowup in the intermediate terms, seems more useful.
The
way to do this now is to use the recently-added conjunction m.
(https://code.jsoftware.com/wiki/Vocabulary/mdot); + m. (2x^32) does an
addition mod 2^32, and ditto * .

- Adding a zero at the beginning and then reversing seems a bit odd--why
not
just add the zero to the end and not reverse?  The result will be
different,
but if the hash is good, it should not change the quality.  (I guess the
only
reason for tacking on the zero is to handle zero-length strings?  It's
annoying that this needs extra effort, and I've specced, but not
implemented,
a solution.  One alternative is to use the recently-added fold family of
the
conjunctions--https://code.jsoftware.com/wiki/Vocabulary/fcap)

Following notes are for recreational purposes only:

- I think a more usual approach nowadays to hashing large strings is to
apply
a mac to some fixed-size blocks (a straightforward--though probably
suboptimal--implementation is (-blocksize) mac\ string) and then a more
expensive/expressive polynomial or similar for combining the blocks--this
should be faster and better quality than djb, though probably more effort.

- You can use _3&ic--after padding to a multiple of 8 bytes--to get the
characters packed more densely--eight per element, rather than one.  If
using
a better hash (i.e., something other than the likes of djb or fna-1a),
that
will let you get through it quicker.

> NB. maybe there's a better int hash?

Perhaps an appelby-style mixer--e.g.
http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html

It's a good idea to apply such a mixer to the result of the string hash,
too.

> ts&|`horners@.('literal' -: datatype) y
> index =: {{ ts | hash y }}

When y is an integer, you reduce it mod x in hash, and then immediately
again
in index.

> clear =: {{ 1 [ table =: a: #~ ts }}

I think this is prone to space leaks.  Shouldn't it reset the table size
to
zero?  You won't be able to reuse the storage from the table anyway.
(Well,
you could, but you don't.)  Alternately, you could save the size passed in
when the table was created and use that instead.  But empirically, it
seems
like a bad idea to let users control such aspects of the hash table--as
the
designer of google's hash table says: https://youtu.be/ncHmEUmJZf4?t=3075.

> cs =: 0 NB. current size

I think this is usually referred to as 'occupancy' (as in 'how many
occupants
does the table have?').

> 1 [ insert every items

It's common to return EMPTY (which is the same as i.0 0) from a verb which
is
called only for effect (rather than for its result).  As a matrix with
zero
rows, it takes up zero lines of precious screenspace at the repl--a pun,
but a
cute one.

On Tue, 18 Jul 2023, Luke D wrote:

> I have written an implementation for a separate chaining hash table that
> handles integer and string keys, which I have posted as a gist
> https://gist.github.com/4hg/f412476edba86599f1b0e6dd2ea8b12d, and I am
> moderately pleased with the outcome. It was my first exploration with the
> topics found in chapter 25 of Learning J.
>
> If anyone cares to comment, I am eager to hear any feedback regarding
> convention or design. Feel free to tear it up, if necessary, however, I
> still have some testing to do, so apologies if I am showcasing a limp
horse.
>
> Best Regards,
> Luke De La Cruz
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Separate chaining hash table

Reply via email to