On 2014-05-23, Chris Angelico wrote:
> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24...@ducksburg.com> wrote:
>> I've also used hashes of strings for other things involving
>> deduplication or fast lookups (because integer equality is faster than
>> string equality). I guess if it's just for deduplication, though, a
>> set of byte arrays is as good as a set of int?
> Want a better way to do that? Use a set or dict and let Python do the
> hashing. It'll be every bit as fast as explicit hashing, plus you get
> the bonus of simplicity.
Well, here's the way it works in my mind:
I can store a set of a zillion strings (or a dict with a zillion
string keys), but every time I test "if new_string in
seen_strings", the computer hashes the new_string using some kind
of "short hash", checks the set for matching buckets (I'm assuming
this is how python tests set membership --- is that right?), then
checks any hash-matches for string equality. Testing string
equality is slower than integer equality, and strings (unless they
are really short) take up a lot more memory than long integers.
So for that kind of thing, I tend to store MD5 hashes or something
similar. Is that wrong?
With the breakdown of the medieval system, the gods of chaos, lunacy,
and bad taste gained ascendancy.
--- Ignatius J Reilly