Why isn't fetcher sending the last fetch time when it does a GET?

Paul Tomblin Sat, 08 Aug 2009 08:49:28 -0700

I'm watching my server logs as I do a second crawl of the site I
crawled yesterday, and it's getting HTTP response code 200 on every
page.  Since none of those pages have changed, ideally the fetcher
should send the last retrieval time in the HTTP header, and the server
would then respond with a 301 code, so it wouldn't have to reparse the
same page.  Wouldn't this be a major win in terms of bandwidth
consumed?  Certainly GoogleBot does it that way.


I'm doing the crawl using a slightly modified version of the script on the Wiki
http://wiki.apache.org/nutch/Crawl


-- 
http://www.linkedin.com/in/paultomblin

Why isn't fetcher sending the last fetch time when it does a GET?

Reply via email to