IndexMerger produces indexes itself cannot merge anymore
--------------------------------------------------------
Key: NUTCH-971
URL: https://issues.apache.org/jira/browse/NUTCH-971
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 1.2
Reporter: Gabriele Kahlout
Priority: Minor
Fix For: 1.3
Here's what I do:
1. index the fetched segs
$ rm -r $new_indexes $temp_indexes
$ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
I examine the index with luke and it's as expected.
2. merge the new index with the previous
$ bin/nutch merge $temp_indexes $new_indexes $indexes
IndexMerger: starting at 2011-03-26 10:24:58
IndexMerger: merging indexes to: crawl/temp_indexes
Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
On the first iteration, when $indexes is empty it works fine by essentially
duplicating $new_indexes into $temp_indexes.
But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index
$temp_indexes contains only #new_indexes and nothing from $indexes, which
indeed still contains the data from the previous iteration. That is, it doesn't
merge.
This unexpected merge behavior is NOT symmetric, i.e.
$ bin/nutch merge $temp_indexes $indexes $new_indexes
IndexMerger: starting at 2011-03-26 10:32:15
IndexMerger: merging indexes to: crawl/temp_indexes
Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
The morale of the story is that a merged index cannot be merged with another,
i.e. bin/nutch merge is meant to merge only 2 indeces generated with bin/nutch
index (or solrindex, perhaps).
The difference between the 2 indeces I can tell is that the merged index
doesn't contain file index_done (and a hidden companion), but adding those to
the merged index before merging it again doesn't solve either.
The way/workaround to make the merged index equivalent to the bin/nutch index
generated index seems to be putting it in a "part" subdirectory:
bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
IndexMerger: starting at 2011-03-26 11:18:10
IndexMerger: merging indexes to: crawl/temp_indexes/part-1
Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
Where was this documented? I'd recommend rather not documenting but have
IndexMerger handle it as in the attached patch.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira