On 2006-07-11, Qiangning Hong <[EMAIL PROTECTED]> wrote: > I'm writing a spider. I have millions of urls in a table (mysql) to > check if a url has already been fetched. To check fast, I am > considering to add a "hash" column in the table, make it a unique key, > and use the following sql statement: > insert ignore into urls (url, hash) values (newurl, hash_of_newurl) > to add new url. > > I believe this will be faster than making the "url" column unique key > and doing string comparation. Right?
I doubt it will be significantly faster. Comparing two strings and hashing a string are both O(N). > However, when I come to Python's builtin hash() function, I > found it produces different values in my two computers! In a > pentium4, hash('a') -> -468864544; in a amd64, hash('a') -> > 12416037344. Does hash function depend on machine's word > length? Apparently. :) The low 32 bits match, so perhaps you should just use that portion of the returned hash? >>> hex(12416037344) '0x2E40DB1E0L' >>> hex(-468864544 & 0xffffffffffffffff) '0xFFFFFFFFE40DB1E0L' >>> hex(12416037344 & 0xffffffff) '0xE40DB1E0L' >>> hex(-468864544 & 0xffffffff) '0xE40DB1E0L' -- Grant Edwards grante Yow! Uh-oh!! I forgot at to submit to COMPULSORY visi.com URINALYSIS! -- http://mail.python.org/mailman/listinfo/python-list