[ 
https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396325#comment-13396325
 ] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

On the For patch ... we shouldn't encode/decode numInts right?  It's
always 128?

Up above, in ForFactory, when we readInt() to get numBytes ... it
seems like we could stuff the header numBits into that same int and
save checking that in FORUtil.decompress....

I think there are a few possible ideas to explore to get faster
PFor/For performance:

  * Get more direct access to the file as an int[]; eg MMapDir could
    expose an IntBuffer from its ByteBuffer (saving the initial copy
    into byte[] that we now do).  Or maybe we add
    IndexInput.readInts(int[]) and dir impl can optimize how that's
    done (MMapDir could use Unsafe.copyBytes... except for little
    endian architectures ... we'd probably have to have separate
    specialized decoder rather than letting Int/ByteBuffer do the byte
    swapping).  This would require the whole file stays aligned w/ int
    (eg the header must be 0 mod 4).

  * Copy/share how oal.packed works, i.e. being able to waste a bit to
    have faster decode (eg storing the 7 bit case as byte[], wasting 1
    bit for each value).

  * Skipping: can we partially decode a block?  EG if we are skipping
    and we know we only want values after the 80th one, then we
    shouldn't decode those first 80...

  * Since doc/freq are "aligned", when we store pointers to a given
    spot, eg in the terms dict or in skip data, we should only store
    the offset once (today we store it twice).

  * Alternatively, maybe we should only save skip data on doc/freq
    block boundaries (prox would still need skip-within-block).

  * Maybe we should store doc & frq blocks interleaved in a single
    file (since they are "aligned") and then skip would skip to the
    start of a doc/frq block pair.

Other ideas...?

                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, 
> Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, 
> LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, 
> LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to