Wow, I appreciate your thorough examination.
Nearly all of the shortcomings came from me following the simple interface
and behavior laid out in chapter 5 of Mark Weiss' "Data Structures and
Algorithm Analysis in C++" (I wanted to ensure that I had something working
as intended), such as using horners to begin with and reducing mod 2^32
(the book uses unsigned ints, though you are right about moving the mod
inside the reduction) and reversing the input string to mimic a for each
loop.
I guess the only reason for tacking on the zero is to handle zero-length
strings?
Frankly, I hadn't even considered empty strings, but rather, I was using 0
as a logical start value.
One alternative is to use the recently-added fold family of the
conjunctions
I have actually never used these before, but it appears that the equivalent
of the C++ code, found in Figure 5.4 of the aforementioned book and below,
would be (0 [F..((2 ^ 32x)|[+37x*]) 3 u: ]), which is much less
intimidating than I had originally thought and handles empty strings as you
mentioned.
unsigned int horners(const std::string& key, int tableSize) {
unsigned int hashValue = 0;
for (char c : key) hashValue = 37 * hashValue + c;
return hashValue % s;
}
clear =: {{ 1 [ table =: a: #~ ts }
I should definitely set occupancy to 0 here regardless, but I am not sure
about setting the table size to 0 (reusing the initial table size to
possibly free up memory seems better). The goal of this method is to
simply empty all of the items inserted into the table, and my current,
naive understanding is that you would change how many buckets you have
when it's time to rehash, which is the situation Matt is describing in his
talk with the set max load factor method. Could you elaborate on this?
Best Regards,
Luke De La Cruz
On Tue, Jul 18, 2023 at 5:13 AM Elijah Stone <elro...@elronnd.net> wrote:
Cool.
> if. contains y do. 0 return. end.
I think the more usual hash table interface, when asked to insert a key
that
matches an existing one, replaces the existing value with the new one.
> horners =. (2 ^ 32x) | [: (+37x&*)/ 0 |.@, 3 u: ]
- Reducing this mod 2^32 at end seems a bit pointless, as you are about to
reduce it mod the table size anyway. Evaluating the intermediate
operations
mod 2^32, to avoid blowup in the intermediate terms, seems more useful.
The
way to do this now is to use the recently-added conjunction m.
(https://code.jsoftware.com/wiki/Vocabulary/mdot); + m. (2x^32) does an
addition mod 2^32, and ditto * .
- Adding a zero at the beginning and then reversing seems a bit odd--why
not
just add the zero to the end and not reverse? The result will be
different,
but if the hash is good, it should not change the quality. (I guess the
only
reason for tacking on the zero is to handle zero-length strings? It's
annoying that this needs extra effort, and I've specced, but not
implemented,
a solution. One alternative is to use the recently-added fold family of
the
conjunctions--https://code.jsoftware.com/wiki/Vocabulary/fcap)
Following notes are for recreational purposes only:
- I think a more usual approach nowadays to hashing large strings is to
apply
a mac to some fixed-size blocks (a straightforward--though probably
suboptimal--implementation is (-blocksize) mac\ string) and then a more
expensive/expressive polynomial or similar for combining the blocks--this
should be faster and better quality than djb, though probably more effort.
- You can use _3&ic--after padding to a multiple of 8 bytes--to get the
characters packed more densely--eight per element, rather than one. If
using
a better hash (i.e., something other than the likes of djb or fna-1a),
that
will let you get through it quicker.
> NB. maybe there's a better int hash?
Perhaps an appelby-style mixer--e.g.
http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html
It's a good idea to apply such a mixer to the result of the string hash,
too.
> ts&|`horners@.('literal' -: datatype) y
> index =: {{ ts | hash y }}
When y is an integer, you reduce it mod x in hash, and then immediately
again
in index.
> clear =: {{ 1 [ table =: a: #~ ts }}
I think this is prone to space leaks. Shouldn't it reset the table size
to
zero? You won't be able to reuse the storage from the table anyway.
(Well,
you could, but you don't.) Alternately, you could save the size passed in
when the table was created and use that instead. But empirically, it
seems
like a bad idea to let users control such aspects of the hash table--as
the
designer of google's hash table says: https://youtu.be/ncHmEUmJZf4?t=3075.
> cs =: 0 NB. current size
I think this is usually referred to as 'occupancy' (as in 'how many
occupants
does the table have?').
> 1 [ insert every items
It's common to return EMPTY (which is the same as i.0 0) from a verb which
is
called only for effect (rather than for its result). As a matrix with
zero
rows, it takes up zero lines of precious screenspace at the repl--a pun,
but a
cute one.
On Tue, 18 Jul 2023, Luke D wrote:
> I have written an implementation for a separate chaining hash table that
> handles integer and string keys, which I have posted as a gist
> https://gist.github.com/4hg/f412476edba86599f1b0e6dd2ea8b12d, and I am
> moderately pleased with the outcome. It was my first exploration with the
> topics found in chapter 25 of Learning J.
>
> If anyone cares to comment, I am eager to hear any feedback regarding
> convention or design. Feel free to tear it up, if necessary, however, I
> still have some testing to do, so apologies if I am showcasing a limp
horse.
>
> Best Regards,
> Luke De La Cruz
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm