Hi Ferenc,
as far as I understand, your tool removes all deleted pages ("nutch
prune", "nutch dedup") out of an index and build a new (smaller) one. In
our workflow we use "nutch prune" at the segment-indexes and then make a
"nutch merge". So the deleted pages does not occur in our main-index.
In our scenario, your tool only helps us to tune the segment indexes.
With our main-index it seems to be of nearly no use... But when changing
the workflow - first merging - then deleting, OptimizeIndex should be a
"must do". We've only been lucky to avoid the problems.
The IndexOptimizer uses a different approach. If I read the code right,
it takes all terms with an idf under a special threshold and reduces the
entries. So the total number of documents for a search changes. With the
default configuration only about 10% of the terms stay in the index. So
the answer to the query "http" get's (much) smaller.
What I still do not know: yes a smaller index makes the system much
fast. But at which price does it come? Which numbers make sense?
Regards
Michael
[EMAIL PROTECTED] wrote:
Dear Michael,
I writed a tool OptimizeIndex.java, this is faster and there aren't
questions: what it is do?
After you optimize index with IndexOptimizer, the number of searching
for 'http' is the same?
Regards,
Ferenc
Michael Nebel wrotte:
Hi,
I fixed the problem with the following patch:
--- IndexOptimizer.java 2005-08-04 12:55:54.000000000 +0200
+++ IndexOptimizer.java.~1.6.~ 2005-01-21 00:48:50.000000000 +0100
@@ -138,7 +138,7 @@
if (score > minScore) {
sdq.put(new ScoreDoc(doc, score));
- if (sdq.size() >= count) { // if sdq overfull
+ if (sdq.size() > count) { // if sdq overfull
sdq.pop(); // remove lowest in
sdq
minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
}
My index shrinked from 8.5 GB to 0.5 GB. I found no documentation
about the background of this tool. Can anyone tell me, what's the idea
behind?
Regards
Michael
Andy Liu wrote:
I believe this tool is unfinished and unsupported.
On 7/22/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
I found an IndexOptimzer in nutch.
When I run it, it dorps an exception:
....
Optimizing url:http from 226957 to 22696
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:
22697
at
org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)
at
org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153)
at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270)
at
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
at
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
at
org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215)
at
org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)
--
Michael Nebel Augustenburger Str. 1, 22769 Hamburg
Telefon: 040 / 851 581 45
http://www.nebel.de/ Mobil: 0172 / 41 53 256
http://www.netluchs.de/ E-Mail: [EMAIL PROTECTED]
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers