[ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1475:
----------------------------------

    Attachment: index-more-2x.patch

This patch uses getModifiedTime
                
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>             Fix For: 1.8
>
>         Attachments: index-more-1xand2x.patch, index-more-2x.patch, 
> index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" 
> and "date" field for the Solr index. The "last modified" field is the last 
> modified date from the http headers if available, if not available it is left 
> empty. Currently, the "date" field is the same as the "last modified" field 
> unless that field is empty in which case getFetchTime is used as a fall back. 
> I think getFetchTime is not a good fall back as it is the next fetch time and 
> often a month or more in the future which doesn't make sense for the date 
> field. Users do not expect webpages/documents with future dates. A more 
> sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of 
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>  from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" 
> field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to