Fetcher does not save a pages Last-Modified value in CrawlDatum
---------------------------------------------------------------
Key: NUTCH-933
URL: https://issues.apache.org/jira/browse/NUTCH-933
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.2
Reporter: Joe Kemp
I added the following code in the output method just after the If (content
!=null) statement.
String lastModified = metadata.get("Last-Modified");
if (lastModified !=null && !lastModified.equals("")) {
try {
Date lastModifiedDate =
DateUtil.parseDate(lastModified);
datum.setModifiedTime(lastModifiedDate.getTime());
} catch (DateParseException e) {
}
}
I now get 304 for pages that haven't changed when I recrawl. Need to do
further testing. Might also need a configuration parameter to turn off this
behavior, allowing pages to be forced to be refreshed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.