[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Michael McCandless (JIRA) Tue, 30 Jul 2013 11:53:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724253#comment-13724253
 ]


Michael McCandless commented on LUCENE-3069:
--------------------------------------------

Wow, those are nice perf results, without implementing intersect!

Intersect really is an optional operation, so we could stop here/now and button 
everything up :)

I like this approach: you moved all the metadata (docFreq, totalTermFreq, 
long[] and byte[] from the PostingsFormatBase) into blocks, and then when we 
really need a term's metadata we go to its block and scan for it (like block 
tree).

I wonder if we could use MonotonicAppendingLongBuffer instead of long[] for the 
in-memory skip data?  Right now it's I think 48 bytes per block (block = 128 
terms), so I guess that's fairly small (.375 bytes per term).

{quote}
It is a little similar to BTTR now, and we can someday control how much
data to keep memory resident (e.g. keep stats in memory but metadata on 
disk, however this should be another issue).
{quote}
That's a nice (future) plus; this way the app can keep "only" the terms+ords in 
RAM, and leave all term metadata on disk.  But this is definitely optional for 
the project and we should separately explore it ...

{quote}
Another good part is, it naturally supports seek by ord.(ah, 
actually I don't understand where it is used).
{quote}

This is also a nice side-effect!
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to