[
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reopened NUTCH-1475:
------------------------------------
Assignee: Sebastian Nagel
Thanks, [~lewismc]!
However, I've overseen a small but important detail: the fetch datum (and not
the current CrawlDatum from CrawlDb) is passed to IndexingFilter plugins, cf.
conversation @user
[[1|http://mail-archives.apache.org/mod_mbox/nutch-user/201306.mbox/%[email protected]%3E]]
(thanks, liaoks!).
Since fetch datum contains the time the fetching has taken place, we should
take this as last fallback value (and not the current time). To use the
lastModified time from CrawlDatum (if set) is not wrong and is closer to 2.x
> Index-More Plugin -- A better fall back value for date field
> ------------------------------------------------------------
>
> Key: NUTCH-1475
> URL: https://issues.apache.org/jira/browse/NUTCH-1475
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 2.1, 1.5.1
> Environment: All
> Reporter: James Sullivan
> Assignee: Sebastian Nagel
> Priority: Minor
> Labels: index-more, plugins
> Fix For: 2.3, 1.8
>
> Attachments: index-more-1xand2x.patch, index-more-2x.patch,
> index-more-2x.patch, NUTCH-1475-trunk-v1.patch
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified"
> and "date" field for the Solr index. The "last modified" field is the last
> modified date from the http headers if available, if not available it is left
> empty. Currently, the "date" field is the same as the "last modified" field
> unless that field is empty in which case getFetchTime is used as a fall back.
> I think getFetchTime is not a good fall back as it is the next fetch time and
> often a month or more in the future which doesn't make sense for the date
> field. Users do not expect webpages/documents with future dates. A more
> sensible fallback would be current date at the time it is indexed.
> This is possible by simply changing line 97 of
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
> from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp"
> field.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira