[ https://issues.apache.org/jira/browse/LUCENE-2588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895506#action_12895506 ]
Robert Muir commented on LUCENE-2588: ------------------------------------- i think this patch as-is is a good improvement (at least as a defensive measure against "noise" terms and other things). it also seems to buy more savings on the non-latin data i tested (60kb -> 40kb). +1 to commit {quote} In the future we could do crazier things. EG there's no real reason why the indexed terms must be regular (every N terms), so, we could instead pick terms more carefully, say "approximately" every N, but favor terms that have a smaller net prefix {quote} I think we should explore this in the future. "randomly" selecting every N terms isn't optimal when allowing a "fudge" of the interval maybe +/- 5 or 10% could intentionally select terms that differ very quickly from their previous term, without wasting a bunch of cpu or unbalancing the terms index... if additional smarts like this could save enough size on average maybe we could rethink lowering the default interval of 128? > terms index should not store useless suffixes > --------------------------------------------- > > Key: LUCENE-2588 > URL: https://issues.apache.org/jira/browse/LUCENE-2588 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2588.patch > > > This idea came up when discussing w/ Robert how to improve our terms index... > The terms dict index today simply grabs whatever term was at a 0 mod 128 > index (by default). > But this is wasteful because you often don't need the suffix of the term at > that point. > EG if the 127th term is aa and the 128th (indexed) term is abcd123456789, > instead of storing that full term you only need to store ab. The suffix is > useless, and uses up RAM since we load the terms index into RAM. > The patch is very simple. The optimization is particularly easy because > terms are now byte[] and we sort in binary order. > I tested on first 10M 1KB Wikipedia docs, and this reduces the terms index > (tii) file from 3.9 MB -> 3.3 MB = 16% smaller (using StandardAnalyzer, > indexing body field tokenized but title / date fields untokenized). I expect > on noisier terms dicts, especially ones w/ bad terms accidentally indexed, > that the savings will be even more. > In the future we could do crazier things. EG there's no real reason why the > indexed terms must be regular (every N terms), so, we could instead pick > terms more carefully, say "approximately" every N, but favor terms that have > a smaller net prefix. We can also index more sparsely in regions where the > net docFreq is lowish, since we can afford somewhat higher seek+scan time to > these terms since enuming their docs will be much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org