Doug Cutting wrote:
Andrzej Bialecki wrote:
Shouldn't this be combined with a HitCollector that collects only the
first-n matches? Otherwise we still need to scan the whole posting
list...
Yes. I was just posting the work-in-progress.
Ok, I just tested IndexSorter for now. It appears to work correctly, at
least I get exactly the same results, with the same scores and the same
explanations, if I run the smae queries on the original and on the
sorted index. For now, the query response time is identical as far as I
can tell.
We will also need to estimate the total number of matches by
extrapolating linearly from the maximum doc id processed.
...which should be reported by the custom HitCollector, right?
Finally, it is probably rather slow for large indexes, whose .fdt
won't fit in memory. A simple way to improve that might be to use
Similarity.floatToByte-encoded floats when sorting (e.g., the norm
from an untokenized field) so that
Yes, for an index that was 5 mln docs the IndexOptimizer takes ~10 min.
to complete, this IndexSorter took over 1 hour...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com