[
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated NUTCH-1475:
---------------------------------
Affects Version/s: (was: nutchgora)
1.5.1
This is an issue for the 1.x branch as well
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
> Key: NUTCH-1475
> URL: https://issues.apache.org/jira/browse/NUTCH-1475
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 2.1, 1.5.1
> Environment: All
> Reporter: James Sullivan
> Priority: Minor
> Labels: index-more, plugins
> Attachments: index-more-2x.patch
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified"
> and "date" field for the Solr index. The "last modified" field is the last
> modified date from the http headers if available, if not available it is left
> empty. Currently, the "date" field is the same as the "last modified" field
> unless that field is empty in which case getFetchTime is used as a fall back.
> I think getFetchTime is not a good fall back as it is the next fetch time and
> often a month or more in the future which doesn't make sense for the date
> field. Users do not expect webpages/documents with future dates. A more
> sensible fallback would be current date at the time it is indexed.
> This is possible by simply changing line 97 of
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
> from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp"
> field.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira