This makes perfect sense. Again, thank you for a detailed response. A min load factor was unfortunately an aspect I had not considered initially. I will now return to the grindstone better off.
Best Regards, Luke De La Cruz On Wed, Jul 19, 2023 at 4:10 AM Elijah Stone <elro...@elronnd.net> wrote: > The load factor of the table, at any given point in time, is the ratio of > the > number of keys occupying the table to the number of hash buckets. The > maximum > load factor is a parameter to the table (as matt says, you may not wish to > expose this parameter in the _interface_, but from the _implementation_ > perspective, it is a parameter): if, following an insert, the load factor > ever > exceeds the max load factor, then the table size is increased and the > table is > rehashed. Notionally, if the load factor gets 'too high', then there will > be > 'a lot' of keys in each hash bucket, so lookups get 'slow'; quantifying > 'too > high' and 'slow' is what gets you the max load factor. In your table, the > max > load factor is 1. > > There is another parameter which it is sensible for the table to have: a > _min_ > load factor. If, following a delete, the load factor goes _below_ the min > load factor, then the table size is _decreased_ and it is rehashed. > Notionally, if the load factor gets 'too low', then there will be 'a lot' > of > buckets with no keys, wasting 'a lot' of space. > > In particular, storing n keys in the table should require no more than > O(n) > space; since your table lacks a min load factor, performing n inserts and > then > n deletes uses O(n) space to store no keys, so it has a space leak. > > Notionally, clearing the table corresponds to deleting each key, one by > one; > so you could figure out, given a min load factor parameter, a current > size, > and a resizing policy, how many hash buckets would remain if all the keys > were > deleted one by one. But this would be a bit frivolous and I see no reason > to > bother. Note in particular that, unless the min load factor is zero > (which > would be useless), the case when the table is empty is a degenerate one > (since > there is no number of hash buckets which can satisfy the min-load-factor > invariant), so you can pick an arbitrary number (maybe zero, maybe one, > maybe > a number of hash buckets which would be acceptable if the table contained > only > one key). (Maintaining as an invariant that there is always at least one > hash > bucket, even when the table contains no occupants, simplifies the table > and > avoids special cases; my not mentioning this and suggesting to clear to > zero > buckets was an oversight on my part.) > > Note also that allowing the user to specify the number of hash buckets the > table initially contains can also violate the min load factor invariant. > This is no accident. If the user specifies an overlarge number, and then > never inserts many keys into the table, that's a space leaks too. And > it's > easy to misestimate how many keys you plan to insert into the table. If > you > pick a small number for them, and they end up inserting a large number of > keys > anyway, then it's not a big deal--you only take a constant (amortised) > overhead for that. Hence, allowing the user to set the initial table size > is > a bad idea for the same reason as allowing them to set the max (and min) > load > factors is. > > (You might have noticed a curious interaction between min and max load > factor. > The min load factor, max load factor, and resize policy need to be chosen > with > awareness of each other, so you don't end up in a situation where you > insert a > key, the table rehashes, remove a key, the table rehashes again, etc.) > > On Tue, 18 Jul 2023, Luke D wrote: > > > Wow, I appreciate your thorough examination. > > > > Nearly all of the shortcomings came from me following the simple > interface > > and behavior laid out in chapter 5 of Mark Weiss' "Data Structures and > > Algorithm Analysis in C++" (I wanted to ensure that I had something > working > > as intended), such as using horners to begin with and reducing mod 2^32 > > (the book uses unsigned ints, though you are right about moving the mod > > inside the reduction) and reversing the input string to mimic a for each > > loop. > > > >> I guess the only reason for tacking on the zero is to handle zero-length > > strings? > > Frankly, I hadn't even considered empty strings, but rather, I was using > 0 > > as a logical start value. > > > >> One alternative is to use the recently-added fold family of the > > conjunctions > > I have actually never used these before, but it appears that the > equivalent > > of the C++ code, found in Figure 5.4 of the aforementioned book and > below, > > would be (0 [F..((2 ^ 32x)|[+37x*]) 3 u: ]), which is much less > > intimidating than I had originally thought and handles empty strings as > you > > mentioned. > > > > unsigned int horners(const std::string& key, int tableSize) { > > unsigned int hashValue = 0; > > for (char c : key) hashValue = 37 * hashValue + c; > > return hashValue % s; > > } > > > >> clear =: {{ 1 [ table =: a: #~ ts } > > I should definitely set occupancy to 0 here regardless, but I am not sure > > about setting the table size to 0 (reusing the initial table size to > > possibly free up memory seems better). The goal of this method is to > > simply empty all of the items inserted into the table, and my current, > > naive understanding is that you would change how many buckets you have > > when it's time to rehash, which is the situation Matt is describing in > his > > talk with the set max load factor method. Could you elaborate on this? > > > > Best Regards, > > Luke De La Cruz > > > > > > On Tue, Jul 18, 2023 at 5:13 AM Elijah Stone <elro...@elronnd.net> > wrote: > > > >> Cool. > >> > >> > if. contains y do. 0 return. end. > >> > >> I think the more usual hash table interface, when asked to insert a key > >> that > >> matches an existing one, replaces the existing value with the new one. > >> > >> > horners =. (2 ^ 32x) | [: (+37x&*)/ 0 |.@, 3 u: ] > >> > >> - Reducing this mod 2^32 at end seems a bit pointless, as you are about > to > >> reduce it mod the table size anyway. Evaluating the intermediate > >> operations > >> mod 2^32, to avoid blowup in the intermediate terms, seems more useful. > >> The > >> way to do this now is to use the recently-added conjunction m. > >> (https://code.jsoftware.com/wiki/Vocabulary/mdot); + m. (2x^32) does an > >> addition mod 2^32, and ditto * . > >> > >> - Adding a zero at the beginning and then reversing seems a bit odd--why > >> not > >> just add the zero to the end and not reverse? The result will be > >> different, > >> but if the hash is good, it should not change the quality. (I guess the > >> only > >> reason for tacking on the zero is to handle zero-length strings? It's > >> annoying that this needs extra effort, and I've specced, but not > >> implemented, > >> a solution. One alternative is to use the recently-added fold family of > >> the > >> conjunctions--https://code.jsoftware.com/wiki/Vocabulary/fcap) > >> > >> Following notes are for recreational purposes only: > >> > >> - I think a more usual approach nowadays to hashing large strings is to > >> apply > >> a mac to some fixed-size blocks (a straightforward--though probably > >> suboptimal--implementation is (-blocksize) mac\ string) and then a more > >> expensive/expressive polynomial or similar for combining the > blocks--this > >> should be faster and better quality than djb, though probably more > effort. > >> > >> - You can use _3&ic--after padding to a multiple of 8 bytes--to get the > >> characters packed more densely--eight per element, rather than one. If > >> using > >> a better hash (i.e., something other than the likes of djb or fna-1a), > >> that > >> will let you get through it quicker. > >> > >> > NB. maybe there's a better int hash? > >> > >> Perhaps an appelby-style mixer--e.g. > >> http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html > >> > >> It's a good idea to apply such a mixer to the result of the string hash, > >> too. > >> > >> > ts&|`horners@.('literal' -: datatype) y > >> > index =: {{ ts | hash y }} > >> > >> When y is an integer, you reduce it mod x in hash, and then immediately > >> again > >> in index. > >> > >> > clear =: {{ 1 [ table =: a: #~ ts }} > >> > >> I think this is prone to space leaks. Shouldn't it reset the table size > >> to > >> zero? You won't be able to reuse the storage from the table anyway. > >> (Well, > >> you could, but you don't.) Alternately, you could save the size passed > in > >> when the table was created and use that instead. But empirically, it > >> seems > >> like a bad idea to let users control such aspects of the hash table--as > >> the > >> designer of google's hash table says: > https://youtu.be/ncHmEUmJZf4?t=3075. > >> > >> > cs =: 0 NB. current size > >> > >> I think this is usually referred to as 'occupancy' (as in 'how many > >> occupants > >> does the table have?'). > >> > >> > 1 [ insert every items > >> > >> It's common to return EMPTY (which is the same as i.0 0) from a verb > which > >> is > >> called only for effect (rather than for its result). As a matrix with > >> zero > >> rows, it takes up zero lines of precious screenspace at the repl--a pun, > >> but a > >> cute one. > >> > >> On Tue, 18 Jul 2023, Luke D wrote: > >> > >> > I have written an implementation for a separate chaining hash table > that > >> > handles integer and string keys, which I have posted as a gist > >> > https://gist.github.com/4hg/f412476edba86599f1b0e6dd2ea8b12d, and I > am > >> > moderately pleased with the outcome. It was my first exploration with > the > >> > topics found in chapter 25 of Learning J. > >> > > >> > If anyone cares to comment, I am eager to hear any feedback regarding > >> > convention or design. Feel free to tear it up, if necessary, however, > I > >> > still have some testing to do, so apologies if I am showcasing a limp > >> horse. > >> > > >> > Best Regards, > >> > Luke De La Cruz > >> > ---------------------------------------------------------------------- > >> > For information about J forums see > http://www.jsoftware.com/forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm