[jira] [Reopened] (NUTCH-1416) IndexerMapReduce can index older version of a document instead of latest one

Sebastian Nagel (JIRA) Thu, 25 Jun 2015 08:19:12 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel reopened NUTCH-1416:
------------------------------------
      Assignee: Sebastian Nagel

Re-opening: if you have to re-index a bunch of segments this cannot be done 
within a single job:
* either run SegmentMerger before
* or do it segment by segment in the right order with some overhead (reading 
CrawlDb and LinkDb again and again)

It should be not too hard to fix. We have to re-establish the correct ordering 
by segment name or fetch time for the 3 items coming from segments:
# fetch datum: can be sorted by fetch time, see NUTCH-1617
# ParseData contains the segment name in content metadata: use this to keep 
only the latests one
# ParseText: need a way to associate it with the segment it stems from, e.g., 
wrap it into a MetaWrapper object as SegmentMerger does

There should be not too much overhead for the default (only one segment is 
indexed): it's only wrapping ParseText and few null-checks whether one of the 
items will be overwritten. Ev., we can even optimize the one-segment-indexing.

> IndexerMapReduce can index older version of a document instead of latest one
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1416
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Jianyun He
>            Assignee: Sebastian Nagel
>            Priority: Critical
>
> When we update the index,can not guarantee that the contents which be indexed 
> is the latest.In the class IndexerMapReduce and method reduce(), it has the 
> following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
>                      OutputCollector<Text, NutchDocument> output, Reporter 
> reporter) throws IOException {
>    ……
>    } else if (value instanceof ParseData) {  
>       parseData = (ParseData)value;
>    } else if (value instanceof ParseText) { 
>       parseText = (ParseText)value;
>    }
>    ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then 
> the key A will correspond to two ParseData objects(located in different 
> segments).But in this code,it does not compare the fetch time and simply 
> overwrites the previous value.So the final value maybe the old one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (NUTCH-1416) IndexerMapReduce can index older version of a document instead of latest one

Reply via email to