I think so, but well you have to prove it anyway. :-)
Am 18.01.2006 um 13:21 schrieb Thomas Delnoij:
Maybe one of the other developers can answer my question as well?
I want to know if I only have to change the Fetcher (
org.apache.nutch.fetcher.Fetcher), lines 236-240, to accomplish unique
MD5Hash for each Page based on their URL.
Thanks is advance,
Thomas D.
On 1/15/06, Thomas Delnoij <[EMAIL PROTECTED]> wrote:
Stefan,
I am reluctant to use Nutch in a way that it was not intended to
be used.
What I will do is change the code so that it only uses the url to
calculate
the hash, run 2 complete generate/fetch/update cycles on a limited
number of
domains and compare the result with a cycle that also uses the
content and
compare the results. If using only the URL does lead to unacceptable
"clutter", I will move back to the "proper" usage of Nutch.
I checked the code, is it true I only have to change the Fetcher (
org.apache.nutch.fetcher.Fetcher), and change lines 236-240 where
ist says
if (content == null) {
content = new Content(url, url, new byte[0], "", new
Properties());
hash = MD5Hash.digest(url);
} else {
hash = MD5Hash.digest(content.getContent());
}
Thanks for your help.
Rgrds, Thomas
On 1/7/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
Am 07.01.2006 um 22:14 schrieb Thomas Delnoij:
I am working with Nutch 0.7.1.
As far as I understand the current implementation (please correct
me if I
am wrong), the MD5Hash is calculated based on the Pages' content.
Pages with
the same content but identified by different URLs, share the same
MD5Hash.
Right. If there is no content hash is caculated from the url for the
moment.
My requirement is to be able to uniquely identify all Pages in
WebDB. Pages
with the same content, but identified by different URL's, should
become a
unique MD5Hash. My question is if this is feasible at all and if
yes, how
this can be accomplished.
For nutch it makes no sense to caculate the hash based on url only.
Caculating hash from content already filter a lot of search engine
spam and in general people are ineterested to find this page once
and
not under all urls that are may available (e.g. dynamic urls -> same
content)
Anyway to realize your need you just need to hack nutch that it will
only use the url as source for hash calculation. That shouldn't be
more than edit some lines code.
HTH
Stefan
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net