[
https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche updated NUTCH-1416:
---------------------------------
Summary: IndexerMapReduce can index older version of a document instead of
latest one (was: Can not update the index)
> IndexerMapReduce can index older version of a document instead of latest one
> ----------------------------------------------------------------------------
>
> Key: NUTCH-1416
> URL: https://issues.apache.org/jira/browse/NUTCH-1416
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Reporter: Jianyun He
> Priority: Critical
>
> When we update the index,can not guarantee that the contents which be indexed
> is the latest.In the class IndexerMapReduce and method reduce(), it has the
> following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
> OutputCollector<Text, NutchDocument> output, Reporter
> reporter) throws IOException {
> ……
> } else if (value instanceof ParseData) {
> parseData = (ParseData)value;
> } else if (value instanceof ParseText) {
> parseText = (ParseText)value;
> }
> ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then
> the key A will correspond to two ParseData objects(located in different
> segments).But in this code,it does not compare the fetch time and simply
> overwrites the previous value.So the final value maybe the old one.
--
This message was sent by Atlassian JIRA
(v6.2#6252)