On Wed, Aug 19, 2009 at 1:00 PM, Ken Krugler<[email protected]> wrote:
>> Another question: is Nutch smart enough to use that signature to
>> determine that, say, http://xcski.com/ and http://xcski.com/index.html
>> are the same page?
>
> I believe the hashes would be the same for either raw MD5 or text signature,
> yes. So on the search side these would get collapsed. Don't know about what
> else you mean as far as "same page" - e.g. one entry in the CrawlDB? If so,
> then somebody else with more up-to-date knowledge of Nutch would need to
> chime in here. Older versions of Nutch would still have these as separate
> entries, FWIR.

Actually, I just checked some of my own pages, and http://xcski.com/
and http://xcski.com/index.html have different signatures, in spite of
them being the same page.  So I guess the answer to that is no, even
if there were logic to make them the same page in CrawlDB, it wouldn't
work.


-- 
http://www.linkedin.com/in/paultomblin

Reply via email to