[jira] [Comment Edited] (LUCENE-7863) Don't repeat postings and positions on ReverseWF, EdgeNGram, etc

Mikhail Khludnev (JIRA) Wed, 14 Jun 2017 14:27:46 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033650#comment-16033650
 ]


Mikhail Khludnev edited comment on LUCENE-7863 at 6/14/17 9:26 PM:
-------------------------------------------------------------------

Let's index six one-word docs:
|foo|
|foo|
|foo|
|bar|
|bar|
|bar|

h3. Inverted index with ReversedWildcardFilter

|term|posting offset (relative)|
|1oof|0|
|1rab|3|
|bar|3| 
|foo|3|

|Postings (absolute values)|
|0,1,2|
|3,4,5|
|3,4,5|
|0,1,2|

Here you see that postings (and positions) are duplicated for every derived 
term.

h2. Proposal: DRY

|term|posting offset (relative)|
|1oof|0|
|1rab|3|
|bar|-3| 
|foo|-3|

|Postings (absolute values)|
|0,1,2|
|3,4,5|

h2. Note
It seems like it's really challenging to implement, giving that codecs doesn't 
allow such tweaking, I had to change {{o.a.l.i}} classes. This code introduces 
the relation between terms see {{FreqProxTermsEnum.getTwinTerm()}} and so one 
(it's one of the ugliest pieces). It also requires to change the term block 
format: posting offsets are written in ZLong (instead of Vlong), since they 
need to be negative. I'm afraid it ruins a lot of tests since I was interested 
in the only one {{TestReversedWildcardFilterFactory}}. It passes. I also 
experiment with 5M enwiki and it seems roughly works: RWF blows index from 13G 
to 28G and this code keeps it at 17G and runs *leading queries fast.
It aims only {{RWF}} where the derived term is 1-1 to the origin one. This 
patch for branch_6x.

h2. Disclaimer
Current patch is mad and dirty ({{trickedFields = Arrays.asList("one", 
"body_txt_en")}}, and plenty of {{sysout}} ), I've just scratched the idea. 

h2. TODO
- How to carry relation between origin and derived NGramm terms (1 - Many)? 
- How to adjust the current {{o.a.l.i}} to bring reduplicated postings to the 
codec?

h2. The next idea
For \*infix\* searches it needs to derive the following terms (for three 
{{bar}} docs and three {{baz}} docs):
|term|position offset|
|ar_bar|0|
|az_baz|3|
|bar|-3|
|baz|3
|r_bar|-3|
|z_baz|3|
Here we should write both postings only once. And on {{\*a\*}} query find both 
posting with a prefix query {{a\*}}. 


  


was (Author: mkhludnev):
Let's index six one word docs:
|foo|
|foo|
|foo|
|bar|
|bar|
|bar|

h3. Index with ReversedWildcardFilter

|term|posting offset (relative)|
|1oof|0|
|1rab|3|
|bar|3| 
|foo|3|

|Postings (absolute values)|
|0,1,2|
|3,4,5|
|3,4,5|
|0,1,2|

Here you see that postings (and positions) are duplicated for every derived 
term.

h2. Proposal - DRY

|term|posting offset (relative)|
|1oof|0|
|1rab|3|
|bar|-3| 
|foo|-3|

|Postings (absolute values)|
|0,1,2|
|3,4,5|

h2. Note
It seems like it's really challenging to implement, giving that codecs doesn't 
allow such tweaking, I had to change {{o.a.l.i}} classes. This code introduces 
the relation between terms see {{FreqProxTermsEnum.getTwinTerm()}} and so one 
(it's one of the ugliest pieces). It also requires to change the term block 
format: posting offsets are written in ZLong (instead of Vlong), since they 
need to be negative. I'm afraid it ruins a lot of tests, since I were 
interested in the only one {{TestReversedWildcardFilterFactory}}. It passes. I 
also experiment with 5M enwiki and it seems roughly works: RWF blows index from 
13G to 28G and this code keeps it at 17G and runs *leading queries fast.
It aims only {{RWF}} where derived term is 1-1 to the origin one. This patch 
for branch_6x.

h2. Disclaimer
Current patch is mad and dirty ({{trickedFields = Arrays.asList("one", 
"body_txt_en")}}, and plenty of {{sysout}} ), I've just scratched the idea. 

h2. TODO
- How to carry relation between origin and derived NGramm terms (1 - Many)? 
- How to adjust the current {{o.a.l.i}} to bring reduplicated postings to the 
codec?

h2. The next idea
For \*infix\* searches it needs to derive the following terms (for three 
{{bar}} docs and thee {{baz}} docs):
|term|position offset|
|ar_bar|0|
|az_baz|3|
|bar|-3|
|baz|3
|r_bar|-3|
|z_baz|3|
Here we should write both postings only once. And on {{\*a\*}} query find both 
posting with a prefix query {{a\*}}. 


  

> Don't repeat postings and positions on ReverseWF, EdgeNGram, etc  
> ------------------------------------------------------------------
>
>                 Key: LUCENE-7863
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7863
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Mikhail Khludnev
>         Attachments: LUCENE-7863.hazard
>
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes. 
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm 
> shuddering to think about EdgeNGrams...
> h2. Proposal 
> _DRY_



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-7863) Don't repeat postings and positions on ReverseWF, EdgeNGram, etc

Reply via email to