On Tue, 27 May 2014 16:13:46 +0100, Adam Funk wrote: > On 2014-05-23, Chris Angelico wrote: > >> On Fri, May 23, 2014 at 8:27 PM, Adam Funk <a24...@ducksburg.com> >> wrote: >>> I've also used hashes of strings for other things involving >>> deduplication or fast lookups (because integer equality is faster than >>> string equality). I guess if it's just for deduplication, though, a >>> set of byte arrays is as good as a set of int? >> >> Want a better way to do that? Use a set or dict and let Python do the >> hashing. It'll be every bit as fast as explicit hashing, plus you get >> the bonus of simplicity. > > Well, here's the way it works in my mind: > > I can store a set of a zillion strings (or a dict with a zillion > string keys), but every time I test "if new_string in seen_strings", > the computer hashes the new_string using some kind of "short hash", > checks the set for matching buckets (I'm assuming this is how python > tests set membership --- is that right?),
So far so good. That applies to all objects, not just strings. > then checks any > hash-matches for string equality. Testing string equality is slower > than integer equality, and strings (unless they are really short) > take up a lot more memory than long integers. But presumably you have to keep the string around anyway. It's going to be somewhere, you can't just throw the string away and garbage collect it. The dict doesn't store a copy of the string, it stores a reference to it, and extra references don't cost much. As for string equality, in pseudo-code it looks something like this: def __eq__(self, other): # Consider only the case where other is a string as well. if self is other: # object identity return True if len(self) != len(other): # checking the length is fast return False # iterate over both strings in lock-step for a in self, b in other: if a != b: return False return True only written in fast C code. So the equality test bails out as quickly as possible, and the only time you have to look at every character is if the two strings are equal but not the same object. MD5 hashing, on the other hand, *always* has to look at every character, and perform quite a complicated set of computations. It's expensive. > So for that kind of thing, I tend to store MD5 hashes or something > similar. Is that wrong? First rule of optimization: Don't do it. Second rule of optimization (for experts only): Don't do it yet. You're making the cardinal mistake of optimization, not what is slow, but what you *assume* will be slow. (Or, replace memory consumption for speed if you prefer.) Python is not C, and your intuitions for what's fast and slow (low and high memory consumption) are likely to be way off. Consider trying to store an MD5 hash instead of the string "Hello World!". If I try this in Python 2.7, I get this: py> import md5, sys py> s = "Hello World!" py> n = int(md5.md5(s).hexdigest(), 16) py> sys.getsizeof(s) 33 py> sys.getsizeof(n) 30 I save a lousy 3 bytes. But in order to save that 3 bytes, I have to generate an md5 object (32 bytes) and a hex string (53 bytes), both of which take time to create and then time to garbage collect. Actually, I don't even save those 3 bytes, since for nearly all reasonable applications, I'll need to keep the string around somewhere. So instead of saving three bytes, it's costing me thirty bytes for the hash, plus who-knows-what needed to manage the book keeping to link that hash to the actual string. Ah, but maybe we can save time hashing them? py> from timeit import Timer py> setup = "from __main__ import s, n" py> t1 = Timer("hash(s)", setup) py> t2 = Timer("hash(n)", setup) py> min(t1.repeat(5)) 0.14668607711791992 py> min(t2.repeat(5)) 0.15973114967346191 Times are in seconds for one million iterations of the code being timed. Or alternatively, it costs about 0.14 microseconds to hash the string, and about 0.15 microseconds to hash the MD5 hash. **Not** to calculate the MD5 hash in the first place, that takes *much* longer: py> t3 = Timer("md5.md5(s).hexdigest()", "import md5; s = 'Hello World!'") py> min(t3.repeat(5)) 2.3326950073242188 but merely to perform a lightweight hash on the long int so as to find which bucket of the dict it belongs to. So, yes, chances are **very** good that you're supposed optimizations in this area are actually pessimizations. If you have not measured where the bottlenecks in your code are, you have no idea which parts of the code need to be optimized. Now, it is conceivable that there may be some circumstances where your strategy makes sense. I'm guessing you would need something like this before you even come close to saving memory and/or time: - rather than short keys like "Hello World!", you are dealing with keys that are typically many tens of thousands of characters long; - most of the strings are the same length and have very similar prefixes (e.g. the first 3000 chars are the same); - rather than "zillions" of them, there are few enough of them that the chances of an MD5 collision is insignificant; (Any MD5 collision is going to play havoc with your strategy of using hashes as a proxy for the real string.) - and you can arrange matters so that you never need to MD5 hash a string twice. Is my guess closer to the actuality than yours? I don't know. I haven't measured it either. But I know that Python is a high-level language with lots of high-level data structures like dicts which trade-off time and memory for programmer convenience, and that I'd want to see some real benchmarks proving that my application was too slow before giving up that convenience with a complicated strategy like this. -- Steven D'Aprano http://import-that.dreamwidth.org/ -- https://mail.python.org/mailman/listinfo/python-list