On Wed, 29 Dec 2004 13:13:22 -0800, Palit, Nilanjan <[EMAIL PROTECTED]> wrote: > Folks, > > Thanks for the good ideas & the performance discussion. I'll try out the > different suggestions. > > Now, regarding Tom Metro's original suggestion for using an MD5 Digest: > I read that the original MD5 algorithm has known issues with collisions. > Any experiences with how well Digest::MD5 does when used with many > millions of keys? Do I need to test for collisions myself (at the > expense of lost performance), or is it pretty well tested (or proved?) > to stand up to an intensive application?
FYI, "known issues" means that we have a known way to produce two files with the same MD5 hash that is faster than just looking for one. Under normal circumstances, to get non-miniscule odds of having a collision somewhere between MD5 keys, you'd need about 2**64 keys. If you have less than, say, a billion keys then you can ignore that possibility for all practical intents and purposes. That said, the suggestion of using MD5 keys is a non-starter for eliminating the performance issue. Calculating an MD5 hash of a string of length n is O(n). In fact _any_ decent hashing algorithm is going to take time proportional to the length of the string because if you try to take less time, then you have to skip parts of the string and then you can't notice changes in the skipped part of the string. Cheers, Ben _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

