[ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-4161: --------------------------------- Attachment: LUCENE-4161.patch First version of the patch. A few things that were internal now need to be exposed, so I tried to do some clean up: * {{CODEC_NAME}} and CODEC_VERSION{START,CURRENT} are public, * the format is an enum (PackedInts.Format.{PACKED,PACKED_SINGLE_BLOCK}), * improved docs overall. There are new factory methods get{Reader,ReaderIterator,Writer}NoHeader that do the same as their get{Reader,ReaderIterator,Writer} counterpart, but with no header writing/checking. Improved performance of Reader/Mutable bulk methods (using code generation, see http://people.apache.org/~jpountz/packed_ints.html vs. http://people.apache.org/~jpountz/packed_ints2.html). {{ReaderIterator}} and {{Writer}} now use the same code as {{Reader}}/{{Mutable}} bulk methods so they are likely to be much faster too. In addition, ReaderIterator now allows consumers to retrieve several values at the same time. {{Direct*}} and {{Packed*ThreeBlocks}} had a lot of duplicate code that was not factorizable so I created scripts to generate them. Something that might still slow down ReaderIterator (probably the most useful class for codecs) a bit is that ReaderIterator always reads one long at a time. Adding a method to bulk-read longs to DataInput (similarly to readBytes) might improve performance. This probably deserves an other issue in JIRA and can be done later. > Make PackedInts usable by codecs > -------------------------------- > > Key: LUCENE-4161 > URL: https://issues.apache.org/jira/browse/LUCENE-4161 > Project: Lucene - Java > Issue Type: Improvement > Components: core/store > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Attachments: LUCENE-4161.patch > > > Some codecs might be interested in using > PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values > efficiently. > The problem is that the serialization format is self contained, and always > writes the name of the codec, its version, its number of bits per value and > its format. For example, if you want to use packed ints to store your > postings list, this is a lot of overhead (at least ~60 bytes per term, in > case you only use one Writer per term, more otherwise). > Users should be able to externalize the storage of metadata to save space. > For example, to use PackedInts to store a postings list, one should be able > to store the codec name, its version and the number of bits per doc in the > header of the terms+postings list instead of having to write it once (or > more!) per term. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org