I'm watching my server logs as I do a second crawl of the site I crawled yesterday, and it's getting HTTP response code 200 on every page. Since none of those pages have changed, ideally the fetcher should send the last retrieval time in the HTTP header, and the server would then respond with a 301 code, so it wouldn't have to reparse the same page. Wouldn't this be a major win in terms of bandwidth consumed? Certainly GoogleBot does it that way.
I'm doing the crawl using a slightly modified version of the script on the Wiki http://wiki.apache.org/nutch/Crawl -- http://www.linkedin.com/in/paultomblin
