[ https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033650#comment-16033650 ]
Mikhail Khludnev edited comment on LUCENE-7863 at 6/14/17 9:26 PM: ------------------------------------------------------------------- Let's index six one-word docs: |foo| |foo| |foo| |bar| |bar| |bar| h3. Inverted index with ReversedWildcardFilter |term|posting offset (relative)| |1oof|0| |1rab|3| |bar|3| |foo|3| |Postings (absolute values)| |0,1,2| |3,4,5| |3,4,5| |0,1,2| Here you see that postings (and positions) are duplicated for every derived term. h2. Proposal: DRY |term|posting offset (relative)| |1oof|0| |1rab|3| |bar|-3| |foo|-3| |Postings (absolute values)| |0,1,2| |3,4,5| h2. Note It seems like it's really challenging to implement, giving that codecs doesn't allow such tweaking, I had to change {{o.a.l.i}} classes. This code introduces the relation between terms see {{FreqProxTermsEnum.getTwinTerm()}} and so one (it's one of the ugliest pieces). It also requires to change the term block format: posting offsets are written in ZLong (instead of Vlong), since they need to be negative. I'm afraid it ruins a lot of tests since I was interested in the only one {{TestReversedWildcardFilterFactory}}. It passes. I also experiment with 5M enwiki and it seems roughly works: RWF blows index from 13G to 28G and this code keeps it at 17G and runs *leading queries fast. It aims only {{RWF}} where the derived term is 1-1 to the origin one. This patch for branch_6x. h2. Disclaimer Current patch is mad and dirty ({{trickedFields = Arrays.asList("one", "body_txt_en")}}, and plenty of {{sysout}} ), I've just scratched the idea. h2. TODO - How to carry relation between origin and derived NGramm terms (1 - Many)? - How to adjust the current {{o.a.l.i}} to bring reduplicated postings to the codec? h2. The next idea For \*infix\* searches it needs to derive the following terms (for three {{bar}} docs and three {{baz}} docs): |term|position offset| |ar_bar|0| |az_baz|3| |bar|-3| |baz|3 |r_bar|-3| |z_baz|3| Here we should write both postings only once. And on {{\*a\*}} query find both posting with a prefix query {{a\*}}. was (Author: mkhludnev): Let's index six one word docs: |foo| |foo| |foo| |bar| |bar| |bar| h3. Index with ReversedWildcardFilter |term|posting offset (relative)| |1oof|0| |1rab|3| |bar|3| |foo|3| |Postings (absolute values)| |0,1,2| |3,4,5| |3,4,5| |0,1,2| Here you see that postings (and positions) are duplicated for every derived term. h2. Proposal - DRY |term|posting offset (relative)| |1oof|0| |1rab|3| |bar|-3| |foo|-3| |Postings (absolute values)| |0,1,2| |3,4,5| h2. Note It seems like it's really challenging to implement, giving that codecs doesn't allow such tweaking, I had to change {{o.a.l.i}} classes. This code introduces the relation between terms see {{FreqProxTermsEnum.getTwinTerm()}} and so one (it's one of the ugliest pieces). It also requires to change the term block format: posting offsets are written in ZLong (instead of Vlong), since they need to be negative. I'm afraid it ruins a lot of tests, since I were interested in the only one {{TestReversedWildcardFilterFactory}}. It passes. I also experiment with 5M enwiki and it seems roughly works: RWF blows index from 13G to 28G and this code keeps it at 17G and runs *leading queries fast. It aims only {{RWF}} where derived term is 1-1 to the origin one. This patch for branch_6x. h2. Disclaimer Current patch is mad and dirty ({{trickedFields = Arrays.asList("one", "body_txt_en")}}, and plenty of {{sysout}} ), I've just scratched the idea. h2. TODO - How to carry relation between origin and derived NGramm terms (1 - Many)? - How to adjust the current {{o.a.l.i}} to bring reduplicated postings to the codec? h2. The next idea For \*infix\* searches it needs to derive the following terms (for three {{bar}} docs and thee {{baz}} docs): |term|position offset| |ar_bar|0| |az_baz|3| |bar|-3| |baz|3 |r_bar|-3| |z_baz|3| Here we should write both postings only once. And on {{\*a\*}} query find both posting with a prefix query {{a\*}}. > Don't repeat postings and positions on ReverseWF, EdgeNGram, etc > ------------------------------------------------------------------ > > Key: LUCENE-7863 > URL: https://issues.apache.org/jira/browse/LUCENE-7863 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index > Reporter: Mikhail Khludnev > Attachments: LUCENE-7863.hazard > > > h2. Context > \*suffix and \*infix\* searches on large indexes. > h2. Problem > Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm > shuddering to think about EdgeNGrams... > h2. Proposal > _DRY_ -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org