Hi Paul, On Aug 19, 2009, at 6:08am, Paul Tomblin wrote:
Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my page has changed since the last time I crawled it? I patched Nutch to properly handle modification dates, and then discovered that my web site doesn't send Modification-Date because it uses shmtl (Server-parsed HTML).
Yes, that's why nobody uses the modification date in the response headers - even when it's there, it often lies.
I assume it's some sort of cryptographic hash of the entire page?
There are two ways for Nutch to calculate the page signature - one is MD5 of the page contents. The other is a "text signature" that tries to be lenient of minor changes to a web page. Which one to use depends on your situation.
Another question: is Nutch smart enough to use that signature to determine that, say, http://xcski.com/ and http://xcski.com/index.html are the same page?
I believe the hashes would be the same for either raw MD5 or text signature, yes. So on the search side these would get collapsed. Don't know about what else you mean as far as "same page" - e.g. one entry in the CrawlDB? If so, then somebody else with more up-to-date knowledge of Nutch would need to chime in here. Older versions of Nutch would still have these as separate entries, FWIR.
-- Ken
