Am 07.01.2006 um 22:14 schrieb Thomas Delnoij:

I am working with Nutch 0.7.1.

As far as I understand the current implementation (please correct me if I am wrong), the MD5Hash is calculated based on the Pages' content. Pages with the same content but identified by different URLs, share the same MD5Hash.
Right. If there is no content hash is caculated from the url for the moment.

My requirement is to be able to uniquely identify all Pages in WebDB. Pages with the same content, but identified by different URL's, should become a unique MD5Hash. My question is if this is feasible at all and if yes, how
this can be accomplished.

For nutch it makes no sense to caculate the hash based on url only. Caculating hash from content already filter a lot of search engine spam and in general people are ineterested to find this page once and not under all urls that are may available (e.g. dynamic urls -> same content) Anyway to realize your need you just need to hack nutch that it will only use the url as source for hash calculation. That shouldn't be more than edit some lines code.

HTH
Stefan

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to