[Nutch-general] Re: MD5Hash

Stefan Groschupf Mon, 23 Jan 2006 12:44:12 -0800

I think so, but well you have to prove it anyway. :-)



Am 18.01.2006 um 13:21 schrieb Thomas Delnoij:

Maybe one of the other developers can answer my question as well?

I want to know if I only have to change the Fetcher (
org.apache.nutch.fetcher.Fetcher), lines 236-240, to accomplish unique
MD5Hash for each Page based on their URL.

Thanks is advance,

Thomas D.

On 1/15/06, Thomas Delnoij <[EMAIL PROTECTED]> wrote:


Stefan,

I am reluctant to use Nutch in a way that it was not intended tobe used.What I will do is change the code so that it only uses the url tocalculatethe hash, run 2 complete generate/fetch/update cycles on a limitednumber ofdomains and compare the result with a cycle that also uses thecontent and

compare the results. If using only the URL does lead to unacceptable
"clutter", I will move back to the "proper" usage of Nutch.

I checked the code, is it true I only have to change the Fetcher (

org.apache.nutch.fetcher.Fetcher), and change lines 236-240 whereist says


if (content == null) {
        content = new Content(url, url, new byte[0], "", new
Properties());
        hash = MD5Hash.digest(url);
      } else {
        hash = MD5Hash.digest(content.getContent());
      }


Thanks for your help.

Rgrds, Thomas

On 1/7/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:



Am 07.01.2006 um 22:14 schrieb Thomas Delnoij:

I am working with Nutch 0.7.1.

As far as I understand the current  implementation (please correct
me if I
am wrong), the MD5Hash is calculated based on the Pages' content.
Pages with
the same content but identified by different URLs, share the same
MD5Hash.

Right. If there is no content hash is caculated from the url for the
moment.


My requirement is to be able to uniquely identify all Pages in
WebDB. Pages
with the same content, but identified by different URL's, should
become a
unique MD5Hash. My question is if this is feasible at all and if
yes, how
this can be accomplished.


For nutch it makes no sense to caculate the hash based on url only.
Caculating hash from content already filter a lot of search engine

spam and in general people are ineterested to find this page onceand

not under all urls that are may available (e.g. dynamic urls -> same
content)
Anyway to realize your need you just need to hack nutch that it will
only use the url as source for hash calculation. That shouldn't be
more than edit some lines code.

HTH
Stefan


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

[Nutch-general] Re: MD5Hash

Reply via email to