[
https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930588#action_12930588
]
Sebastian Nagel commented on NUTCH-933:
---------------------------------------
The modifiedTime stored in a CrawlDatum record is not the "Last-Modified" time
sent by the responding server (or the time stamp of a file, in case
protocol-file is used) but the time a document was fetched.
Is there any reason?
Determining the "Last-Modified" time is somewhat difficult since it may be
specified in the HTTP header or in HTML as <META HTTP-EQUIV="Last-Modified"
...>. But it would be a nice-to-have information. In addition, the index-more
indexing filter which provides a field "lastModified" does the job not very
well: it should take the value from content meta data (which seems to be mostly
correct) and not from parse meta data.
Beside: re-crawling with if-modified-since is not affected: there is no
difference if the time of the last fetch is sent because only if the document
has been modified since the last fetch it must be re-fetched.
> Fetcher does not save a pages Last-Modified value in CrawlDatum
> ---------------------------------------------------------------
>
> Key: NUTCH-933
> URL: https://issues.apache.org/jira/browse/NUTCH-933
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.2
> Reporter: Joe Kemp
>
> I added the following code in the output method just after the If (content
> !=null) statement.
> String lastModified = metadata.get("Last-Modified");
> if (lastModified !=null && !lastModified.equals("")) {
> try {
> Date lastModifiedDate =
> DateUtil.parseDate(lastModified);
>
> datum.setModifiedTime(lastModifiedDate.getTime());
> } catch (DateParseException e) {
>
> }
> }
> I now get 304 for pages that haven't changed when I recrawl. Need to do
> further testing. Might also need a configuration parameter to turn off this
> behavior, allowing pages to be forced to be refreshed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.