Am 07.01.2006 um 22:14 schrieb Thomas Delnoij:
I am working with Nutch 0.7.1.
As far as I understand the current implementation (please correct
me if I
am wrong), the MD5Hash is calculated based on the Pages' content.
Pages with
the same content but identified by different URLs, share the same
MD5Hash.
Right. If there is no content hash is caculated from the url for the
moment.
My requirement is to be able to uniquely identify all Pages in
WebDB. Pages
with the same content, but identified by different URL's, should
become a
unique MD5Hash. My question is if this is feasible at all and if
yes, how
this can be accomplished.
For nutch it makes no sense to caculate the hash based on url only.
Caculating hash from content already filter a lot of search engine
spam and in general people are ineterested to find this page once and
not under all urls that are may available (e.g. dynamic urls -> same
content)
Anyway to realize your need you just need to hack nutch that it will
only use the url as source for hash calculation. That shouldn't be
more than edit some lines code.
HTH
Stefan
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general