Hi Ferenc,

as far as I understand, your tool removes all deleted pages ("nutch
prune", "nutch dedup") out of an index and build a new (smaller) one. In
our workflow we use "nutch prune" at the segment-indexes and then make a
"nutch merge". So the deleted pages does not occur in our main-index.
In our scenario, your tool only helps us to tune the segment indexes.
With our main-index it seems to be of nearly no use... But when changing
the workflow - first merging - then deleting, OptimizeIndex should be a
"must do". We've only been lucky to avoid the problems.

The IndexOptimizer uses a different approach. If I read the code right,
it takes all terms with an idf under a special threshold and reduces the
entries. So the total number of documents for a search changes. With the
default configuration only about 10% of the terms stay in the index. So
the answer to the query "http" get's (much) smaller.

What I still do not know: yes a smaller index makes the system much fast. But at which price does it come? Which numbers make sense?

Regards

        Michael



[EMAIL PROTECTED] wrote:

Dear Michael,

I writed a tool OptimizeIndex.java, this is faster and there aren't questions: what it is do? After you optimize index with IndexOptimizer, the number of searching for 'http' is the same?

Regards,
   Ferenc

Michael Nebel wrotte:

Hi,

I fixed the problem with the following patch:

--- IndexOptimizer.java 2005-08-04 12:55:54.000000000 +0200
+++ IndexOptimizer.java.~1.6.~  2005-01-21 00:48:50.000000000 +0100
@@ -138,7 +138,7 @@

         if (score > minScore) {
           sdq.put(new ScoreDoc(doc, score));
-          if (sdq.size() >= count) {               // if sdq overfull
+          if (sdq.size() > count) {               // if sdq overfull
sdq.pop(); // remove lowest in sdq
             minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
           }

My index shrinked from 8.5 GB to 0.5 GB. I found no documentation about the background of this tool. Can anyone tell me, what's the idea behind?

Regards

    Michael



Andy Liu wrote:

I believe this tool is unfinished and unsupported.

On 7/22/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

I found an IndexOptimzer in nutch.
When I run it, it dorps an exception:
....
Optimizing url:http from 226957 to 22696
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 22697 at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
       at
org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153)
       at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
       at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
       at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
       at
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
       at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
       at
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
       at
org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215)
       at
org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)





--
Michael Nebel                   Augustenburger Str. 1, 22769 Hamburg
                                Telefon:   040 / 851 581 45
http://www.nebel.de/            Mobil:     0172 / 41 53 256
http://www.netluchs.de/         E-Mail:    [EMAIL PROTECTED]



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to