Stefan,
I am reluctant to use Nutch in a way that it was not intended to be used.
What I will do is change the code so that it only uses the url to calculate
the hash, run 2 complete generate/fetch/update cycles on a limited number of
domains and compare the result with a cycle that also uses the content and
compare the results. If using only the URL does lead to unacceptable
"clutter", I will move back to the "proper" usage of Nutch.
I checked the code, is it true I only have to change the Fetcher (
org.apache.nutch.fetcher.Fetcher), and change lines 236-240 where ist says
if (content == null) {
content = new Content(url, url, new byte[0], "", new Properties());
hash = MD5Hash.digest(url);
} else {
hash = MD5Hash.digest(content.getContent());
}
Thanks for your help.
Rgrds, Thomas
On 1/7/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
>
> Am 07.01.2006 um 22:14 schrieb Thomas Delnoij:
>
> > I am working with Nutch 0.7.1.
> >
> > As far as I understand the current implementation (please correct
> > me if I
> > am wrong), the MD5Hash is calculated based on the Pages' content.
> > Pages with
> > the same content but identified by different URLs, share the same
> > MD5Hash.
> Right. If there is no content hash is caculated from the url for the
> moment.
> >
> > My requirement is to be able to uniquely identify all Pages in
> > WebDB. Pages
> > with the same content, but identified by different URL's, should
> > become a
> > unique MD5Hash. My question is if this is feasible at all and if
> > yes, how
> > this can be accomplished.
>
> For nutch it makes no sense to caculate the hash based on url only.
> Caculating hash from content already filter a lot of search engine
> spam and in general people are ineterested to find this page once and
> not under all urls that are may available (e.g. dynamic urls -> same
> content)
> Anyway to realize your need you just need to hack nutch that it will
> only use the url as source for hash calculation. That shouldn't be
> more than edit some lines code.
>
> HTH
> Stefan
>