[Nutch-general] Re: MD5Hash

Thomas Delnoij Wed, 18 Jan 2006 04:23:21 -0800

Maybe one of the other developers can answer my question as well?

I want to know if I only have to change the Fetcher (
org.apache.nutch.fetcher.Fetcher), lines 236-240, to accomplish unique
MD5Hash for each Page based on their URL.


Thanks is advance,

Thomas D.

On 1/15/06, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
>
> Stefan,
>
> I am reluctant to use Nutch in a way that it was not intended to be used.
> What I will do is change the code so that it only uses the url to calculate
> the hash, run 2 complete generate/fetch/update cycles on a limited number of
> domains and compare the result with a cycle that also uses the content and
> compare the results. If using only the URL does lead to unacceptable
> "clutter", I will move back to the "proper" usage of Nutch.
>
> I checked the code, is it true I only have to change the Fetcher (
> org.apache.nutch.fetcher.Fetcher), and change lines 236-240 where ist says
>
> if (content == null) {
>         content = new Content(url, url, new byte[0], "", new
> Properties());
>         hash = MD5Hash.digest(url);
>       } else {
>         hash = MD5Hash.digest(content.getContent());
>       }
>
>
> Thanks for your help.
>
> Rgrds, Thomas
>
> On 1/7/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> >
> >
> > Am 07.01.2006 um 22:14 schrieb Thomas Delnoij:
> >
> > > I am working with Nutch 0.7.1.
> > >
> > > As far as I understand the current  implementation (please correct
> > > me if I
> > > am wrong), the MD5Hash is calculated based on the Pages' content.
> > > Pages with
> > > the same content but identified by different URLs, share the same
> > > MD5Hash.
> > Right. If there is no content hash is caculated from the url for the
> > moment.
> > >
> > > My requirement is to be able to uniquely identify all Pages in
> > > WebDB. Pages
> > > with the same content, but identified by different URL's, should
> > > become a
> > > unique MD5Hash. My question is if this is feasible at all and if
> > > yes, how
> > > this can be accomplished.
> >
> > For nutch it makes no sense to caculate the hash based on url only.
> > Caculating hash from content already filter a lot of search engine
> > spam and in general people are ineterested to find this page once and
> > not under all urls that are may available (e.g. dynamic urls -> same
> > content)
> > Anyway to realize your need you just need to hack nutch that it will
> > only use the url as source for hash calculation. That shouldn't be
> > more than edit some lines code.
> >
> > HTH
> > Stefan
> >
>
>

[Nutch-general] Re: MD5Hash

Reply via email to