[
https://issues.apache.org/jira/browse/LUCENE-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir resolved LUCENE-2654.
---------------------------------
Resolution: Duplicate
duplicate of LUCENE-2872
> bulk-code each chunk b/w indexed terms in the terms dict
> --------------------------------------------------------
>
> Key: LUCENE-2654
> URL: https://issues.apache.org/jira/browse/LUCENE-2654
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 4.0
> Reporter: Michael McCandless
> Priority: Minor
>
> This is an idea for exploration that came up w/ Robert...
> In PrefixCodedTermsDict (used by the default Standard codec), we encode each
> term entry "standalone", using vInts. We store the changed suffix (start,
> end, bytes), then metadata for the term like docFreq, frq start, prx start,
> skip start. Each of these ints is a vInt, which is relatively costly.
> If instead we store the N terms between indexed terms "column-stride", using
> bulk codec like FOR/PFOR, so that the 32 docFreqs are stored as one block, 32
> frq deltas as another, etc., then seek and next should be faster. Ie, we
> could make decode of the metadata lazy, so that a seek to a term that does
> not exist may be able avoid any metadata decode entirely. Sequential
> scanning (lots of .next in a row) would also be faster, even if it needs the
> metadata since bulk-decode should be faster than multiple vInt decodes.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]