Hi,
i have problem with the option If-Modified-Since with Nutch.
I want crawl on a web syte every day, so i have in nutch-site.html the
right setting of property db.fetch.interval.default.
But i want to limit Nutch to fetch only page that changed using the
If-Modified-Since header.
I found some resources on web to do this task, but when i recrawl page
afeter fetch-interval, nutch download all pages. I use Nutch 1.0 whith
protocol http. I don't use Adaptive Scheduler. In HttpResponse.java i
added the code:
if (datum.getModifiedTime() > 0) {
String httpDate = HttpDateFormat.toString(datum.getModifiedTime());
Http.LOG.debug("modified time: " + httpDate);
reqStr.append("If-Modified-Since: " + httpDate);
reqStr.append("\r\n");
}
else if (datum.getFetchTime() > 0) {
String httpDate = HttpDateFormat.toString(datum.getFetchTime());
Http.LOG.debug("modified time: " + httpDate);
reqStr.append("If-Modified-Since: " + httpDate);
reqStr.append("\r\n");
}
reqStr.append("\r\n");
because there was a bug that prevent the use of If-Modified-Since.
Also i did other change to Fetcher.java so i have the correct value of
LastModified in the CrawlDb
I try to crawl other web site because i want understand if it is a
problem of my web server that not support if-modified-since. But in
every test, i have always response code 200 even if the lastModified
of web page is older than LastModified in CrawlDb.
Can anyone tell me how to correctly use the If-Modified-Since?
Thanks,
Cavalaglio Davide