[
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881120#comment-16881120
]
Michael McCandless commented on LUCENE-8753:
--------------------------------------------
+1 to push this new codec in Lucene, e.g. under codecs or sandbox or misc
modules, if we can avoid making changes to other sources (once LUCENE-8906 is
fixed).
> New PostingFormat - UniformSplit
> --------------------------------
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 8.0
> Reporter: Bruno Roustant
> Assignee: David Smiley
> Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
> Time Spent: 3h 20m
> Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4
> objectives:
> - Clear design and simple code.
> - Easily extensible, for both the logic and the index format.
> - Light memory usage with a very compact FST.
> - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
> The principle is to split the list of terms into blocks and use a FST to
> access the block, but not as a prefix trie, rather with a seek-floor pattern.
> For the selection of the blocks, there is a target average block size (number
> of terms), with an allowed delta variation (10%) to compare the terms and
> select the one with the minimal distinguishing prefix.
> There are also several optimizations inside the block to make it more
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark,
> comparing UniformSplit with BlockTree. Find it in the first comment and also
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
> - TermQuery and PhraseQuery are improved.
> - PrefixQuery and WildcardQuery are ok.
> - Fuzzy queries are clearly less performant, because BlockTree is so
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we
> have already exercised this PostingsFormat extensibility to create a
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]