Hi Lucene users, I'm looking into ways to improve the filtered and sorted searches over our untokenized fields. For instance filter by fields A and B (could be term, prefix, range) and sort by A value.
Lucene scores and collects the documents in DocId order. Normally DocIds are assigned incrementally as you add documents to the index. So they are random with respect to doc value ordinals for a particular field. Hence when a query matches more documents in a segment than the page size, Lucene still has to score and collect *all* documents and sort at the end to decide what documents should constitute the top page. When the number of matching documents are way more than the page size, this has an impact on perf. When we use SortedMergePolicy, merged segments would have DocIds in the same order as the sort field(s) of the merge policy. If there is a query with a sort spec consistent with the segment's sort order, docs would be scored and collected in the sort property value order, and hence can be early terminated once page size is reached. This yields significant perf-gains. However index time sorting is limited to only one sort spec. We need the ability to sort efficiently using more than one sort spec. I want to understand what it would take to allow early termination with query time sorting but without index-time sorting. Or what issues would prevent us from doing so. At a high level, it appears theoretically possible to me with the following set of changes. I want to check with the community for suggestions, whether I am overlooking/missing some problems this would face, or whether this seems viable: 1. We would need to be able to get the docIds for sorted doc values. Lucene already exposes a FieldCache that is enumerable by sorted DocValues. It exposes DocId->DocValue and DocId->ordinal mapping but not the other way around. We'd need to change FieldCache to also keep a reverse map of ordinal->DocId and expose it via a new api. This makes possible to get the DocId associated with the current DocValue/ordinal during enumeration. 1. We would have to change query scorers to be aware of the desired docId order (that would come from the field cache of the sort property) so they can score and collect in that order. Currently all DocItSetIterators allow advancing forward in DocId. We would have to change that: 1. TermScorer uses postings reader's BlockDocsEnum as its DocIdSetIterator. DocIds for each term is kept as a block, each block keeping DocIds in order. It also maintains multi-level skip lists to be able to scan to a targetDoc Id efficiently. These skip lists maintain only forward pointers. So we would have to write a codec - postings writer - that also keeps multilevel skip lists with *backwards* pointers. And we would have to write a postings reader with a BlockDocsEnum that allows moving backwards and forward when seeking DocIds (using the backwards multilevel skip lists). 1. For scorers that directly use field cache filters, the underlying DocIdSetIterator is a FixedBitSet. It allows only moving forward using words and bit operators. We would need to change that to go in either direction. This still would be efficient given FixedBitSet is an in-memory bitmap structure. 2. Once the building blocks above are there, we would need to change queries and scorers to remove any assumptions of DocIds strictly moving forward. For instance FilteredQuery uses a leap frog strategy that makes strict assumptions that DocIds only move forward. ConjunctionScorer makes the same assumption in the way it evaluates and progresses each individual conjunction. I am relatively new to the lucene world and would appreciate thoughts and feedback on this. Thank you, -Ayse