[jira] [Commented] (LUCENE-2962) Skip data should be inlined into the postings lists

Michael McCandless (JIRA) Tue, 23 Apr 2013 04:35:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638969#comment-13638969
 ]


Michael McCandless commented on LUCENE-2962:
--------------------------------------------

Hi Billy,

The proposal looks good!

I think it needs some milestones with dates ... I would separate the
"dirt path": getting a basic vInt based impl working, probably first
index-time (writer) and then the reader, from experiments like
different ways of compressing the skip data, performance experiments
across different skip settings / corpora, etc.

And perhaps add some more detail about the design of the postings
format, ie skip blocks will be interleaved into each posting stream,
etc.

Separately, it's curious we have no tasks that are hurt that much from
only single-level skipping (though we should test the crazy
minShouldMatch tasks too).  I think we need a corpus with more
documents?  Maybe try wikimediumfull (33.3M) instead of just the 10M?

                
> Skip data should be inlined into the postings lists
> ---------------------------------------------------
>
>                 Key: LUCENE-2962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>              Labels: gsoc2013
>         Attachments: proposal.txt
>
>
> Today, we store all skip data as a separate blob at the end of a given term's 
> postings (if that term occurs in enough docs to warrant skip data).
> But this adds overhead during decoding -- we have to seek to a different 
> place for the initial load, we have to init separate readers, we have to seek 
> again while using the lower levels of the skip data, etc.  Also, we have to 
> fully decode all skip information even if we are not going to use it (eg if I 
> only want docIDs, I still must decode position offset and lastPayloadLength).
> If instead we interleaved skip data into the postings file, we could keep it 
> local, and "private" to each file that needs skipping.  This should make it 
> least costly to init and then use the skip data, which'd be a good perf gain 
> for eg PhraseQuery, AndQuery.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2962) Skip data should be inlined into the postings lists

Reply via email to