[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-8753:
-----------------------------------
    Description: 
This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
objectives:
 - Clear design and simple code.
 - Easily extensible, for both the logic and the index format.
 - Light memory usage with a very compact FST.
 - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.

(the pdf attached explains visually the technique in more details)
 The principle is to split the list of terms into blocks and use a FST to 
access the block, but not as a prefix trie, rather with a seek-floor pattern. 
For the selection of the blocks, there is a target average block size (number 
of terms), with an allowed delta variation (10%) to compare the terms and 
select the one with the minimal distinguishing prefix.
 There are also several optimizations inside the block to make it more compact 
and speed up the loading/scanning.

The performance obtained is interesting with the luceneutil benchmark, 
comparing UniformSplit with BlockTree. Find it in the first comment and also 
attached for better formatting.

Although the precise percentages vary between runs, three main points:
 - TermQuery and PhraseQuery are improved.
 - PrefixQuery and WildcardQuery are ok.
 - Fuzzy queries are clearly less performant, because BlockTree is so optimized 
for them.

Compared to BlockTree, FST size is reduced by 15%, and segment writing time is 
reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.

This initial version passes all Lucene tests. Use “ant test 
-Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.

Subjectively, we think we have fulfilled our goal of code simplicity. And we 
have already exercised this PostingsFormat extensibility to create a different 
flavor for our own use-case.

Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley

  was:
This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
objectives:
- Clear design and simple code.
- Easily extensible, for both the logic and the index format.
- Light memory usage with a very compact FST.
- Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.

(the pdf attached explains visually the technique in more details)
 The principle is to split the list of terms into blocks and use a FST to 
access the block, but not as a prefix trie, rather with a seek-floor pattern. 
For the selection of the blocks, there is a target average block size (number 
of terms), with an allowed delta variation (10%) to compare the terms and 
select the one with the minimal distinguishing prefix.
There are also several optimizations inside the block to make it more compact 
and speed up the loading/scanning.

The performance obtained is interesting with the luceneutil benchmark, 
comparing UniformSplit with BlockTree. Find it in the first comment.
 
 Although the precise percentages vary between runs, three main points:
- TermQuery and PhraseQuery are improved.
- PrefixQuery and WildcardQuery are ok.
- Fuzzy queries are clearly less performant, because BlockTree is so optimized 
for them.

Compared to BlockTree, FST size is reduced by 15%, and segment writing time is 
reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
 
 This initial version passes all Lucene tests. Use “ant test 
-Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.

Subjectively, we think we have fulfilled our goal of code simplicity. And we 
have already exercised this PostingsFormat extensibility to create a different 
flavor for our own use-case.
 
 Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley


> New PostingFormat - UniformSplit
> --------------------------------
>
>                 Key: LUCENE-8753
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8753
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.0
>            Reporter: Bruno Roustant
>            Priority: Major
>         Attachments: Uniform Split Technique.pdf
>
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to