Is SIGNATURE_KEY (aka nutch.content.digest) a valid way to check if my
page has changed since the last time I crawled it?  I patched Nutch to
properly handle modification dates, and then discovered that my web
site doesn't send Modification-Date because it uses shmtl
(Server-parsed HTML).  I assume it's some sort of cryptographic hash
of the entire page?

Another question: is Nutch smart enough to use that signature to
determine that, say, http://xcski.com/ and http://xcski.com/index.html
are the same page?


-- 
http://www.linkedin.com/in/paultomblin

Reply via email to