AMIRAULT Martin created LUCENE-7482:
---------------------------------------
Summary: Faster sorted index search for reverse order search
Key: LUCENE-7482
URL: https://issues.apache.org/jira/browse/LUCENE-7482
Project: Lucene - Core
Issue Type: New Feature
Reporter: AMIRAULT Martin
Priority: Minor
We are currently using Lucene here in my company for our main product.
Our search functionnality is quite basic and the results are always sorted
given a predefined field. The user is only able to choose the sort order
(Asc/Desc).
I am currently investigating using the index sort feature with
EarlyTerminationSortingCollector.
This is quite a shame searching on a sorted index in reverse order do not have
any optimization and was wondering if it would be possible to make it faster by
creating a special "ReverseSortingCollector" for this purpose.
I am aware the posting list is designed to be always iterated in the same
order, so it is not about early-terminating the search but more about
filtering-out unneeded documents more efficiently.
If a segment is sorted in reverse order, we can work out easily the docId from
which documents should be collected.
Here is a sample quick code:
{quote}
public class ReverseSortingCollector extends FilterCollector {
/** Sort used to sort the search results */
protected final Sort sort;
/** Number of documents to collect in each segment */
protected final int numDocsToCollect;
[...]
@Override
public LeafCollector getLeafCollector(LeafReaderContext context) throws
IOException {
LeafReader reader = context.reader();
Sort segmentSort = reader.getIndexSort();
if (isReverseOrder(sort, segmentSort)) {//segment is sorted in reverse
order than the search sort
//Here we can easily work out the docNum from which we
should collect
long collectFrom = context.reader().numDocs() -
numDocsToCollect;
return new FilterLeafCollector(in.getLeafCollector(context)) {
@Override
public void collect(int doc) throws IOException {
if (doc >= collectFrom) {//only delegates
super.collect(doc);
}
}
};
}else{
return in.getLeafCollector(context);
}
}
}
{quote}
This is specially efficient when used along with TopFieldCollector as a lot of
docValue lookup would not take place.
In my experiment it reduced search time by 90%.
However I was wondering if it is correct, as my knowledge of Lucene is still
quite limited.
Especially is it correct to assume that LeafReader docId always span from
0->LeafReader.numDocs() ?
Note : Does not support paging. Could be eventually implemented by providing a
way to look up the docId to match from the last document collected (eg for
LongPoint querying the docId closest to the previously returned value...)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]