On Wed, Aug 19, 2009 at 1:00 PM, Ken Krugler<[email protected]> wrote: >> Another question: is Nutch smart enough to use that signature to >> determine that, say, http://xcski.com/ and http://xcski.com/index.html >> are the same page? > > I believe the hashes would be the same for either raw MD5 or text signature, > yes. So on the search side these would get collapsed. Don't know about what > else you mean as far as "same page" - e.g. one entry in the CrawlDB? If so, > then somebody else with more up-to-date knowledge of Nutch would need to > chime in here. Older versions of Nutch would still have these as separate > entries, FWIR.
Actually, I just checked some of my own pages, and http://xcski.com/ and http://xcski.com/index.html have different signatures, in spite of them being the same page. So I guess the answer to that is no, even if there were logic to make them the same page in CrawlDB, it wouldn't work. -- http://www.linkedin.com/in/paultomblin
