Adrien Grand created LUCENE-10396:
-------------------------------------

             Summary: Automatically create sparse indexes for sort fields
                 Key: LUCENE-10396
                 URL: https://issues.apache.org/jira/browse/LUCENE-10396
             Project: Lucene - Core
          Issue Type: Task
            Reporter: Adrien Grand
         Attachments: sorted_conjunction.png

On Elasticsearch we're more and more leveraging index sorting not as a way to 
be able to early terminate sorted queries, but as a way to cluster doc IDs that 
share similar properties so that queries can take advantage of it. For instance 
imagine you're maintaining a catalog of cars for sale, by sorting by car type, 
then fuel type then price. Then all cars with the same type, fuel type and 
similar prices will be stored in a contiguous range of doc IDs. Without index 
sorting, conjunctions across these 3 fields would be almost a worst-case 
scenario as every clause might match lots of documents while their intersection 
might not. With index sorting enabled however, there's only a very small number 
of calls to advance() that would lead to doc IDs that do not match, because 
these advance() calls that do not lead to a match would always jump over a 
large number of doc IDs. I had created that example for ApacheCon last year 
that demonstrates the benefits of index sorting on conjunctions. In both cases, 
the index is storing the same data, it just gets different doc ID ordering 
thanks to index sorting:
!sorted_conjunction.png!

While index sorting can help improve query efficiency out-of-the-box, there is 
a lot more we can do by taking advantage of the index sort explicitly. For 
instance {{IndexSortSortedNumericDocValuesRangeQuery}} can speed up range 
queries on fields that are primary sort fields by performing a binary search to 
identify the first and last documents that match the range query. I would like 
to introduce [sparse 
indexes|https://en.wikipedia.org/wiki/Database_index#Sparse_index] for fields 
that are used for index sorting, with the goal of improving the runtime of 
{{IndexSortSortedNumericDocValuesRangeQuery}} by making it less I/O intensive 
and making it easier and more efficient to leverage index sorting to filter on 
subsequent sort fields. A simple form of a sparse index could consist of 
storing every N-th values of the fields that are used for index sorting.

In terms of implementation, sparse indexing should be cheap enough that we 
wouldn't need to make it configurable and could enable it automatically as soon 
as index sorting is enabled. And it would get its own file format abstraction.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to