Hi Michael,

Thanks for answer. I think you read the source code right. The tool removes low frequency pages from index. I think: If you would like to use lower number of pages simplier to fetch lower number of top pages (generate segments -TopN).

We don't use the indexmerge tool, because if we would like to balance segments between backends, it is more works. When we fetched the number of pages what we would like, we make a dedup, and after a prune, and after a OptimizeIndex. All segments have an own index. We count real indexed pages in index, balance between backends. If any backends CPU load more than others, we move a segments from it to other. In this case not need reindex, and indexmergetool again.

Regards,
   Ferenc


Michael Nebel wrotte:

Hi Ferenc,

as far as I understand, your tool removes all deleted pages ("nutch
prune", "nutch dedup") out of an index and build a new (smaller) one. In
our workflow we use "nutch prune" at the segment-indexes and then make a
"nutch merge". So the deleted pages does not occur in our main-index.
In our scenario, your tool only helps us to tune the segment indexes.
With our main-index it seems to be of nearly no use... But when changing
the workflow - first merging - then deleting, OptimizeIndex should be a
"must do". We've only been lucky to avoid the problems.

The IndexOptimizer uses a different approach. If I read the code right,
it takes all terms with an idf under a special threshold and reduces the
entries. So the total number of documents for a search changes. With the
default configuration only about 10% of the terms stay in the index. So
the answer to the query "http" get's (much) smaller.

What I still do not know: yes a smaller index makes the system much fast. But at which price does it come? Which numbers make sense?

Regards

    Michael



[EMAIL PROTECTED] wrote:

Dear Michael,

I writed a tool OptimizeIndex.java, this is faster and there aren't questions: what it is do? After you optimize index with IndexOptimizer, the number of searching for 'http' is the same?

Regards,
   Ferenc

Michael Nebel wrotte:

Hi,

I fixed the problem with the following patch:

--- IndexOptimizer.java 2005-08-04 12:55:54.000000000 +0200
+++ IndexOptimizer.java.~1.6.~  2005-01-21 00:48:50.000000000 +0100
@@ -138,7 +138,7 @@

         if (score > minScore) {
           sdq.put(new ScoreDoc(doc, score));
-          if (sdq.size() >= count) {               // if sdq overfull
+          if (sdq.size() > count) {               // if sdq overfull
sdq.pop(); // remove lowest in sdq
             minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
           }

My index shrinked from 8.5 GB to 0.5 GB. I found no documentation about the background of this tool. Can anyone tell me, what's the idea behind?

Regards

    Michael



Andy Liu wrote:

I believe this tool is unfinished and unsupported.

On 7/22/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

I found an IndexOptimzer in nutch.
When I run it, it dorps an exception:
....
Optimizing url:http from 226957 to 22696
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 22697 at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
       at
org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153)
       at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
       at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
       at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
       at
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
       at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
       at
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
       at
org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215)
       at
org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)








-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to