Ferenczi Jim created LUCENE-7423:
------------------------------------

             Summary: AutoPrefixPostingsFormat: a PostingsFormat optimized for 
prefix queries on text fields.
                 Key: LUCENE-7423
                 URL: https://issues.apache.org/jira/browse/LUCENE-7423
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/sandbox
            Reporter: Ferenczi Jim
            Priority: Minor


The autoprefix terms dict added in 
https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
https://issues.apache.org/jira/browse/LUCENE-7317.

The new points API is now used to do efficient range queries but the 
replacement for prefix string queries is unclear. The edge ngrams could be used 
instead but they have a lot of drawbacks and are hard to configure correctly. 
The completion postings format is also a good replacement but it requires to 
have a big FST in RAM and it cannot be intersected with other fields. 

This patch is a proposal for a new PostingsFormat optimized for prefix query on 
string fields. It detects prefixes that match "enough" terms and writes 
auto-prefix terms into their own virtual field.
 At search time the virtual field is used to speed up prefix queries that match 
"enough" terms.
The auto-prefix terms are built in two pass:
* The first pass builds a compact prefix tree. Since the terms enum is sorted 
the prefixes are flushed on the fly depending on the input. For each prefix we 
build its corresponding inverted lists using a DocIdSetBuilder. The first pass 
visits each term of the field TermsEnum only once. When a prefix is flushed 
from the prefix tree its inverted lists is dumped into a temporary file for 
further use. This is necessary since the prefixes are not sorted when they are 
removed from the tree. The selected auto prefixes are sorted at the end of the 
first pass.

* The second pass is a sorted scan of the prefixes and the temporary file is 
used to read the corresponding inverted lists.

The patch is just a POC and there are rooms for optimizations but the first 
results are promising:
I tested the patch with the geonames dataset. I indexed all the titles with the 
KeywordAnalyzer and compared the index/merge time and the size of the indices. 

The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
572M on disk and it took 130s to index and optimize the 11M titles. 

The auto prefix index takes 287M on disk and took 70s to index and optimize the 
same 11M titles. Among the 287M, only 170M are used for the auto prefix fields 
and the rest is for the regular keyword field. All the auto prefixes were 
generated for this test (at least 2 terms per auto-prefix).  

The queries have similar performance since we are sure on both sides that one 
inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to