Doug Cutting wrote:

Andrzej Bialecki wrote:

Shouldn't this be combined with a HitCollector that collects only the first-n matches? Otherwise we still need to scan the whole posting list...


Yes.  I was just posting the work-in-progress.


Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. For now, the query response time is identical as far as I can tell.

We will also need to estimate the total number of matches by extrapolating linearly from the maximum doc id processed.


...which should be reported by the custom HitCollector, right?

Finally, it is probably rather slow for large indexes, whose .fdt won't fit in memory. A simple way to improve that might be to use Similarity.floatToByte-encoded floats when sorting (e.g., the norm from an untokenized field) so that


Yes, for an index that was 5 mln docs the IndexOptimizer takes ~10 min. to complete, this IndexSorter took over 1 hour...

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to