Sebastian Nagel closed NUTCH-1475.

> Index-More Plugin -- A better fall back value for date field
> ------------------------------------------------------------
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Assignee: Sebastian Nagel
>            Priority: Minor
>              Labels: index-more, plugins
>             Fix For: 1.7, 2.2.1
>         Attachments: NUTCH-1475-trunk-v1.patch, NUTCH-1475-trunk-v2.patch, 
> index-more-1xand2x.patch, index-more-2x.patch, index-more-2x.patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" 
> and "date" field for the Solr index. The "last modified" field is the last 
> modified date from the http headers if available, if not available it is left 
> empty. Currently, the "date" field is the same as the "last modified" field 
> unless that field is empty in which case getFetchTime is used as a fall back. 
> I think getFetchTime is not a good fall back as it is the next fetch time and 
> often a month or more in the future which doesn't make sense for the date 
> field. Users do not expect webpages/documents with future dates. A more 
> sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of 
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>  from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" 
> field.

This message was sent by Atlassian Jira

Reply via email to