On 2006-07-12, Carl Banks <[EMAIL PROTECTED]> wrote: > Grant Edwards wrote: >> On 2006-07-11, Qiangning Hong <[EMAIL PROTECTED]> wrote: >> >> > I'm writing a spider. I have millions of urls in a table (mysql) to >> > check if a url has already been fetched. To check fast, I am >> > considering to add a "hash" column in the table, make it a unique key, >> > and use the following sql statement: >> > insert ignore into urls (url, hash) values (newurl, hash_of_newurl) >> > to add new url. >> > >> > I believe this will be faster than making the "url" column unique key >> > and doing string comparation. Right? >> >> I doubt it will be significantly faster. Comparing two strings >> and hashing a string are both O(N). > > Playing Devil's Advocate: The hash would be a one-time operation during > database insertion, whereas string comparison would happen every > search.
Good point. > Conceivably, comparing hash strings (which is O(1)) could > result in a big savings compared to comparing regular strings; Still, I doubt that the URLs are long enough so that there's a significant difference. > but I expect most decent sql implementations already hash data > internally, so rolling your own hash would be useless at best. Precisely. DB designers and implementers have been working on this problem for 30 years. I doubt the OP is going to be able to best them with a few minutes work. > If the OP's database is lacking, md5 is probably fine. Perhaps > using a subset of the md5 (the low 32 bits, say) could speed > up comparisons at risk of more collisions. Probably a good > trade off unless the DB is humungous. My advice: do it the simple way first (let the DB handle it). Don't try to fix a problem until you know it exists. Premature optimization.... -- Grant Edwards grante Yow! It's strange, but I'm at only TRULY ALIVE when I'm visi.com covered in POLKA DOTS and TACO SAUCE... -- http://mail.python.org/mailman/listinfo/python-list