[ 
https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianyun He updated NUTCH-1416:
------------------------------

    Description: 
When we update the index,can not guarantee that the contents which be indexed 
is the latest.In the class IndexerMapReduce and method reduce(), it has the 
following code:
public void reduce(Text key, Iterator<NutchWritable> values,
                     OutputCollector<Text, NutchDocument> output, Reporter 
reporter) throws IOException {
   ……
   } else if (value instanceof ParseData) {  
      parseData = (ParseData)value;
   } else if (value instanceof ParseText) { 
      parseText = (ParseText)value;
   }
   ……
}
For example,30 days ago,I fetched the web page A,now I fetch it again. Then the 
key A will correspond to two ParseData objects(located in different 
segments).But in this code,it does not compare the fetch time and simply 
overwrites the previous value.So the final value maybe the old one.

  was:
When we update the index,can not guarantee that the contents which be indexed 
is the latest.In the class IndexerMapReduce and method reduce(), it has the 
following code:
public void reduce(Text key, Iterator<NutchWritable> values,
                     OutputCollector<Text, NutchDocument> output, Reporter 
reporter) throws IOException {
   ……
   } else if (value instanceof ParseData) {  
        parseData = (ParseData)value;
   } else if (value instanceof ParseText) { 
        parseText = (ParseText)value;
   }
   ……
}
For example,30 days ago,I fetched the web page A,now I fetch it again. Then the 
key A will correspond to two ParseData objects(located in different 
segments).But in this code,it does not compare the fetch time and simply 
overwrites the previous value.So the final value maybe the old one.

    
> Can not update the index
> ------------------------
>
>                 Key: NUTCH-1416
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Jianyun He
>
> When we update the index,can not guarantee that the contents which be indexed 
> is the latest.In the class IndexerMapReduce and method reduce(), it has the 
> following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
>                      OutputCollector<Text, NutchDocument> output, Reporter 
> reporter) throws IOException {
>    ……
>    } else if (value instanceof ParseData) {  
>       parseData = (ParseData)value;
>    } else if (value instanceof ParseText) { 
>       parseText = (ParseText)value;
>    }
>    ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then 
> the key A will correspond to two ParseData objects(located in different 
> segments).But in this code,it does not compare the fetch time and simply 
> overwrites the previous value.So the final value maybe the old one.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to