Hi Lucene users,
I'm looking into ways to improve the filtered and sorted searches over our 
untokenized fields. For instance filter by fields A and B (could be term, 
prefix, range) and sort by A value.

Lucene scores and collects the documents in DocId order. Normally DocIds are 
assigned incrementally as you add documents to the index. So they are random 
with respect to doc value ordinals for a particular field. Hence when a query 
matches more documents in a segment than the page size, Lucene still has to 
score and collect *all* documents and sort at the end to decide what documents 
should constitute the top page. When the number of matching documents are way 
more than the page size, this has an impact on perf.

When we use SortedMergePolicy, merged segments would have DocIds in the same 
order as the sort field(s) of the merge policy. If there is a query with a sort 
spec consistent with the segment's sort order, docs would be scored and 
collected in the sort property value order, and hence can be early terminated 
once page size is reached. This yields significant perf-gains. However index 
time sorting is limited to only one sort spec. We need the ability to sort 
efficiently using more than one sort spec.

I want to understand what it would take to allow early termination with query 
time sorting but without index-time sorting. Or what issues would prevent us 
from doing so.
At a high level, it appears theoretically possible to me with the following set 
of changes. I want to check with the community for suggestions, whether I am 
overlooking/missing some problems this would face, or whether this seems viable:

  1.  We would need to be able to get the docIds for sorted doc values.

Lucene already exposes a FieldCache that is enumerable by sorted DocValues. It 
exposes DocId->DocValue and DocId->ordinal mapping but not the other way around.

We'd need to change FieldCache to also keep a reverse map of ordinal->DocId and 
expose it via a new api. This makes possible to get the DocId associated with 
the current DocValue/ordinal during enumeration.

  1.  We would have to change query scorers to be aware of the desired docId 
order (that would come from the field cache of the sort property) so they can 
score and collect in that order.



Currently all DocItSetIterators allow advancing forward in DocId. We would have 
to change that:



  1.  TermScorer uses postings reader's BlockDocsEnum as its DocIdSetIterator.

DocIds for each term is kept as a block, each block keeping DocIds in order. It 
also maintains multi-level skip lists to be able to scan to a targetDoc Id 
efficiently. These skip lists maintain only forward pointers.



So we would have to write a codec - postings writer - that also keeps 
multilevel skip lists with *backwards* pointers.

And we would have to write a postings reader with a BlockDocsEnum that allows 
moving backwards and forward when seeking DocIds (using the backwards 
multilevel skip lists).

  1.  For scorers that directly use field cache filters, the underlying 
DocIdSetIterator is a FixedBitSet. It allows only moving forward using words 
and bit operators. We would need to change that to go in either direction. This 
still would be efficient given FixedBitSet is an in-memory bitmap structure.
  2.  Once the building blocks above are there, we would need to change queries 
and scorers to remove any assumptions of DocIds strictly moving forward.

For instance FilteredQuery uses a leap frog strategy that makes strict 
assumptions that DocIds only move forward.

ConjunctionScorer makes the same assumption in the way it evaluates and 
progresses each individual conjunction.

I am relatively new to the lucene world and would appreciate thoughts and 
feedback on this.
Thank you,
-Ayse

Reply via email to