Paul Tomblin wrote:
On Wed, Aug 19, 2009 at 1:00 PM, Ken Krugler<kkrugler_li...@transpac.com> wrote:
Another question: is Nutch smart enough to use that signature to
determine that, say, http://xcski.com/ and http://xcski.com/index.html
are the same page?
I believe the hashes would be the same for either raw MD5 or text signature,
yes. So on the search side these would get collapsed. Don't know about what
else you mean as far as "same page" - e.g. one entry in the CrawlDB? If so,
then somebody else with more up-to-date knowledge of Nutch would need to
chime in here. Older versions of Nutch would still have these as separate
entries, FWIR.

Actually, I just checked some of my own pages, and http://xcski.com/
and http://xcski.com/index.html have different signatures, in spite of
them being the same page.  So I guess the answer to that is no, even
if there were logic to make them the same page in CrawlDB, it wouldn't
work.

There is nothing magic about the process of calculating a signature - eg. MD5Signature just takes Content.getContent() (array of bytes) and runs it through MD5. So if you get different MD5 values, then your content was indeed different (even if it was only an advertisement link somewhere on the page).

You could use urlnormalizer to collapse www.example.com/ and www.example.com/index.html into a single entry, in fact there is a commented-out rule like that in urlnormalizer config file. But as you observed above, there may be cases when these two are not really the same page, so you need to be careful ...


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to