Re: [Jprogramming] Separate chaining hash table

Luke D Wed, 19 Jul 2023 15:19:58 -0700

This makes perfect sense. Again, thank you for a detailed response. A min
load factor was unfortunately an aspect I had not considered initially. I
will now return to the grindstone better off.


Best Regards,
Luke De La Cruz


On Wed, Jul 19, 2023 at 4:10 AM Elijah Stone <elro...@elronnd.net> wrote:

> The load factor of the table, at any given point in time, is the ratio of
> the
> number of keys occupying the table to the number of hash buckets.  The
> maximum
> load factor is a parameter to the table (as matt says, you may not wish to
> expose this parameter in the _interface_, but from the _implementation_
> perspective, it is a parameter): if, following an insert, the load factor
> ever
> exceeds the max load factor, then the table size is increased and the
> table is
> rehashed.  Notionally, if the load factor gets 'too high', then there will
> be
> 'a lot' of keys in each hash bucket, so lookups get 'slow'; quantifying
> 'too
> high' and 'slow' is what gets you the max load factor.  In your table, the
> max
> load factor is 1.
>
> There is another parameter which it is sensible for the table to have: a
> _min_
> load factor.  If, following a delete, the load factor goes _below_ the min
> load factor, then the table size is _decreased_ and it is rehashed.
> Notionally, if the load factor gets 'too low', then there will be 'a lot'
> of
> buckets with no keys, wasting 'a lot' of space.
>
> In particular, storing n keys in the table should require no more than
> O(n)
> space; since your table lacks a min load factor, performing n inserts and
> then
> n deletes uses O(n) space to store no keys, so it has a space leak.
>
> Notionally, clearing the table corresponds to deleting each key, one by
> one;
> so you could figure out, given a min load factor parameter, a current
> size,
> and a resizing policy, how many hash buckets would remain if all the keys
> were
> deleted one by one.  But this would be a bit frivolous and I see no reason
> to
> bother.  Note in particular that, unless the min load factor is zero
> (which
> would be useless), the case when the table is empty is a degenerate one
> (since
> there is no number of hash buckets which can satisfy the min-load-factor
> invariant), so you can pick an arbitrary number (maybe zero, maybe one,
> maybe
> a number of hash buckets which would be acceptable if the table contained
> only
> one key).  (Maintaining as an invariant that there is always at least one
> hash
> bucket, even when the table contains no occupants, simplifies the table
> and
> avoids special cases; my not mentioning this and suggesting to clear to
> zero
> buckets was an oversight on my part.)
>
> Note also that allowing the user to specify the number of hash buckets the
> table initially contains can also violate the min load factor invariant.
> This is no accident.  If the user specifies an overlarge number, and then
> never inserts many keys into the table, that's a space leaks too.  And
> it's
> easy to misestimate how many keys you plan to insert into the table.  If
> you
> pick a small number for them, and they end up inserting a large number of
> keys
> anyway, then it's not a big deal--you only take a constant (amortised)
> overhead for that.  Hence, allowing the user to set the initial table size
> is
> a bad idea for the same reason as allowing them to set the max (and min)
> load
> factors is.
>
> (You might have noticed a curious interaction between min and max load
> factor.
> The min load factor, max load factor, and resize policy need to be chosen
> with
> awareness of each other, so you don't end up in a situation where you
> insert a
> key, the table rehashes, remove a key, the table rehashes again, etc.)
>
> On Tue, 18 Jul 2023, Luke D wrote:
>
> > Wow, I appreciate your thorough examination.
> >
> > Nearly all of the shortcomings came from me following the simple
> interface
> > and behavior laid out in chapter 5 of Mark Weiss' "Data Structures and
> > Algorithm Analysis in C++" (I wanted to ensure that I had something
> working
> > as intended), such as using horners to begin with and reducing mod 2^32
> > (the book uses unsigned ints, though you are right about moving the mod
> > inside the reduction) and reversing the input string to mimic a for each
> > loop.
> >
> >> I guess the only reason for tacking on the zero is to handle zero-length
> > strings?
> > Frankly, I hadn't even considered empty strings, but rather, I was using
> 0
> > as a logical start value.
> >
> >> One alternative is to use the recently-added fold family of the
> > conjunctions
> > I have actually never used these before, but it appears that the
> equivalent
> > of the C++ code, found in Figure 5.4 of the aforementioned book and
> below,
> > would be (0 [F..((2 ^ 32x)|[+37x*]) 3 u: ]), which is much less
> > intimidating than I had originally thought and handles empty strings as
> you
> > mentioned.
> >
> > unsigned int horners(const std::string& key, int tableSize) {
> >    unsigned int hashValue = 0;
> >    for (char c : key) hashValue = 37 * hashValue + c;
> >    return hashValue % s;
> > }
> >
> >> clear =: {{ 1 [ table =: a: #~ ts }
> > I should definitely set occupancy to 0 here regardless, but I am not sure
> > about setting the table size to 0 (reusing the initial table size to
> > possibly free up memory seems better). The goal of this method is to
> > simply empty all of the items inserted into the table, and my current,
> > naive understanding is that you would change how many buckets you have
> > when it's time to rehash, which is the situation Matt is describing in
> his
> > talk with the set max load factor method. Could you elaborate on this?
> >
> > Best Regards,
> > Luke De La Cruz
> >
> >
> > On Tue, Jul 18, 2023 at 5:13 AM Elijah Stone <elro...@elronnd.net>
> wrote:
> >
> >> Cool.
> >>
> >> > if. contains y do. 0 return. end.
> >>
> >> I think the more usual hash table interface, when asked to insert a key
> >> that
> >> matches an existing one, replaces the existing value with the new one.
> >>
> >> > horners =. (2 ^ 32x) | [: (+37x&*)/ 0 |.@, 3 u: ]
> >>
> >> - Reducing this mod 2^32 at end seems a bit pointless, as you are about
> to
> >> reduce it mod the table size anyway.  Evaluating the intermediate
> >> operations
> >> mod 2^32, to avoid blowup in the intermediate terms, seems more useful.
> >> The
> >> way to do this now is to use the recently-added conjunction m.
> >> (https://code.jsoftware.com/wiki/Vocabulary/mdot); + m. (2x^32) does an
> >> addition mod 2^32, and ditto * .
> >>
> >> - Adding a zero at the beginning and then reversing seems a bit odd--why
> >> not
> >> just add the zero to the end and not reverse?  The result will be
> >> different,
> >> but if the hash is good, it should not change the quality.  (I guess the
> >> only
> >> reason for tacking on the zero is to handle zero-length strings?  It's
> >> annoying that this needs extra effort, and I've specced, but not
> >> implemented,
> >> a solution.  One alternative is to use the recently-added fold family of
> >> the
> >> conjunctions--https://code.jsoftware.com/wiki/Vocabulary/fcap)
> >>
> >> Following notes are for recreational purposes only:
> >>
> >> - I think a more usual approach nowadays to hashing large strings is to
> >> apply
> >> a mac to some fixed-size blocks (a straightforward--though probably
> >> suboptimal--implementation is (-blocksize) mac\ string) and then a more
> >> expensive/expressive polynomial or similar for combining the
> blocks--this
> >> should be faster and better quality than djb, though probably more
> effort.
> >>
> >> - You can use _3&ic--after padding to a multiple of 8 bytes--to get the
> >> characters packed more densely--eight per element, rather than one.  If
> >> using
> >> a better hash (i.e., something other than the likes of djb or fna-1a),
> >> that
> >> will let you get through it quicker.
> >>
> >> > NB. maybe there's a better int hash?
> >>
> >> Perhaps an appelby-style mixer--e.g.
> >> http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html
> >>
> >> It's a good idea to apply such a mixer to the result of the string hash,
> >> too.
> >>
> >> > ts&|`horners@.('literal' -: datatype) y
> >> > index =: {{ ts | hash y }}
> >>
> >> When y is an integer, you reduce it mod x in hash, and then immediately
> >> again
> >> in index.
> >>
> >> > clear =: {{ 1 [ table =: a: #~ ts }}
> >>
> >> I think this is prone to space leaks.  Shouldn't it reset the table size
> >> to
> >> zero?  You won't be able to reuse the storage from the table anyway.
> >> (Well,
> >> you could, but you don't.)  Alternately, you could save the size passed
> in
> >> when the table was created and use that instead.  But empirically, it
> >> seems
> >> like a bad idea to let users control such aspects of the hash table--as
> >> the
> >> designer of google's hash table says:
> https://youtu.be/ncHmEUmJZf4?t=3075.
> >>
> >> > cs =: 0 NB. current size
> >>
> >> I think this is usually referred to as 'occupancy' (as in 'how many
> >> occupants
> >> does the table have?').
> >>
> >> > 1 [ insert every items
> >>
> >> It's common to return EMPTY (which is the same as i.0 0) from a verb
> which
> >> is
> >> called only for effect (rather than for its result).  As a matrix with
> >> zero
> >> rows, it takes up zero lines of precious screenspace at the repl--a pun,
> >> but a
> >> cute one.
> >>
> >> On Tue, 18 Jul 2023, Luke D wrote:
> >>
> >> > I have written an implementation for a separate chaining hash table
> that
> >> > handles integer and string keys, which I have posted as a gist
> >> > https://gist.github.com/4hg/f412476edba86599f1b0e6dd2ea8b12d, and I
> am
> >> > moderately pleased with the outcome. It was my first exploration with
> the
> >> > topics found in chapter 25 of Learning J.
> >> >
> >> > If anyone cares to comment, I am eager to hear any feedback regarding
> >> > convention or design. Feel free to tear it up, if necessary, however,
> I
> >> > still have some testing to do, so apologies if I am showcasing a limp
> >> horse.
> >> >
> >> > Best Regards,
> >> > Luke De La Cruz
> >> > ----------------------------------------------------------------------
> >> > For information about J forums see
> http://www.jsoftware.com/forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Separate chaining hash table

Reply via email to