[ 
https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1416.
----------------------------------

    Resolution: Won't Fix

There is no way to know which segment a NutchWritable (e.g. ParseText) comes 
from. One way around this issue is to index the segments one by one instead of 
using the whole segments dir.
Note that this should not be an issue in Nutch 2.x as the WebTable contains 
only the latest version of a document

> IndexerMapReduce can index older version of a document instead of latest one
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1416
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Jianyun He
>            Priority: Critical
>
> When we update the index,can not guarantee that the contents which be indexed 
> is the latest.In the class IndexerMapReduce and method reduce(), it has the 
> following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
>                      OutputCollector<Text, NutchDocument> output, Reporter 
> reporter) throws IOException {
>    ……
>    } else if (value instanceof ParseData) {  
>       parseData = (ParseData)value;
>    } else if (value instanceof ParseText) { 
>       parseText = (ParseText)value;
>    }
>    ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then 
> the key A will correspond to two ParseData objects(located in different 
> segments).But in this code,it does not compare the fetch time and simply 
> overwrites the previous value.So the final value maybe the old one.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to