Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Andrzej Bialecki Tue, 13 Dec 2005 06:43:23 -0800

Doug Cutting wrote:

Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only thefirst-n matches? Otherwise we still need to scan the whole postinglist...
Yes.  I was just posting the work-in-progress.

Ok, I just tested IndexSorter for now. It appears to work correctly, atleast I get exactly the same results, with the same scores and the sameexplanations, if I run the smae queries on the original and on thesorted index. For now, the query response time is identical as far as Ican tell.

We will also need to estimate the total number of matches byextrapolating linearly from the maximum doc id processed.



...which should be reported by the custom HitCollector, right?

Finally, it is probably rather slow for large indexes, whose .fdtwon't fit in memory. A simple way to improve that might be to useSimilarity.floatToByte-encoded floats when sorting (e.g., the normfrom an untokenized field) so that

Yes, for an index that was 5 mln docs the IndexOptimizer takes ~10 min.to complete, this IndexSorter took over 1 hour...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Reply via email to