[Nutch-general] Re: MD5Hash

Stefan Groschupf Sat, 07 Jan 2006 13:24:41 -0800


Am 07.01.2006 um 22:14 schrieb Thomas Delnoij:

I am working with Nutch 0.7.1.
As far as I understand the current implementation (please correctme if Iam wrong), the MD5Hash is calculated based on the Pages' content.Pages withthe same content but identified by different URLs, share the sameMD5Hash.

Right. If there is no content hash is caculated from the url for themoment.

My requirement is to be able to uniquely identify all Pages inWebDB. Pageswith the same content, but identified by different URL's, shouldbecome aunique MD5Hash. My question is if this is feasible at all and ifyes, how
this can be accomplished.

For nutch it makes no sense to caculate the hash based on url only.Caculating hash from content already filter a lot of search enginespam and in general people are ineterested to find this page once andnot under all urls that are may available (e.g. dynamic urls -> samecontent)Anyway to realize your need you just need to hack nutch that it willonly use the url as source for hash calculation. That shouldn't bemore than edit some lines code.

HTH

Stefan


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: MD5Hash

Reply via email to