[ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396325#comment-13396325 ]
Michael McCandless commented on LUCENE-3892: -------------------------------------------- On the For patch ... we shouldn't encode/decode numInts right? It's always 128? Up above, in ForFactory, when we readInt() to get numBytes ... it seems like we could stuff the header numBits into that same int and save checking that in FORUtil.decompress.... I think there are a few possible ideas to explore to get faster PFor/For performance: * Get more direct access to the file as an int[]; eg MMapDir could expose an IntBuffer from its ByteBuffer (saving the initial copy into byte[] that we now do). Or maybe we add IndexInput.readInts(int[]) and dir impl can optimize how that's done (MMapDir could use Unsafe.copyBytes... except for little endian architectures ... we'd probably have to have separate specialized decoder rather than letting Int/ByteBuffer do the byte swapping). This would require the whole file stays aligned w/ int (eg the header must be 0 mod 4). * Copy/share how oal.packed works, i.e. being able to waste a bit to have faster decode (eg storing the 7 bit case as byte[], wasting 1 bit for each value). * Skipping: can we partially decode a block? EG if we are skipping and we know we only want values after the 80th one, then we shouldn't decode those first 80... * Since doc/freq are "aligned", when we store pointers to a given spot, eg in the terms dict or in skip data, we should only store the offset once (today we store it twice). * Alternatively, maybe we should only save skip data on doc/freq block boundaries (prox would still need skip-within-block). * Maybe we should store doc & frq blocks interleaved in a single file (since they are "aligned") and then skip would skip to the start of a doc/frq block pair. Other ideas...? > Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, > Simple9/16/64, etc.) > ------------------------------------------------------------------------------------- > > Key: LUCENE-3892 > URL: https://issues.apache.org/jira/browse/LUCENE-3892 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Michael McCandless > Labels: gsoc2012, lucene-gsoc-12 > Fix For: 4.1 > > Attachments: LUCENE-3892_for.patch, LUCENE-3892_pfor.patch, > LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, > LUCENE-3892_settings.patch > > > On the flex branch we explored a number of possible intblock > encodings, but for whatever reason never brought them to completion. > There are still a number of issues opened with patches in different > states. > Initial results (based on prototype) were excellent (see > http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html > ). > I think this would make a good GSoC project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org