[jira] [Updated] (LUCENE-4161) Make PackedInts usable by codecs

2012-07-03 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4161:
-

Attachment: LUCENE-4161.patch

New patch. I renamed 'n' to 'iterations', fixed the style issues and improved 
documentation. All core tests pass, including the backward-compatibility tests 
I added in r1356228.

I think this is a good idea to work on int[] encoding/decoding in a separate 
issue given how big this patch already is.

 Make PackedInts usable by codecs
 

 Key: LUCENE-4161
 URL: https://issues.apache.org/jira/browse/LUCENE-4161
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4161.patch, LUCENE-4161.patch


 Some codecs might be interested in using 
 PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values 
 efficiently.
 The problem is that the serialization format is self contained, and always 
 writes the name of the codec, its version, its number of bits per value and 
 its format. For example, if you want to use packed ints to store your 
 postings list, this is a lot of overhead (at least ~60 bytes per term, in 
 case you only use one Writer per term, more otherwise).
 Users should be able to externalize the storage of metadata to save space. 
 For example, to use PackedInts to store a postings list, one should be able 
 to store the codec name, its version and the number of bits per doc in the 
 header of the terms+postings list instead of having to write it once (or 
 more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4161) Make PackedInts usable by codecs

2012-06-25 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-4161:
-

Attachment: LUCENE-4161.patch

First version of the patch.

A few things that were internal now need to be exposed, so I tried to do some 
clean up:
 * {{CODEC_NAME}} and CODEC_VERSION{START,CURRENT} are public,
 * the format is an enum (PackedInts.Format.{PACKED,PACKED_SINGLE_BLOCK}),
 * improved docs overall.

There are new factory methods get{Reader,ReaderIterator,Writer}NoHeader that do 
the same as their get{Reader,ReaderIterator,Writer} counterpart, but with no 
header writing/checking.

Improved performance of Reader/Mutable bulk methods (using code generation, see 
http://people.apache.org/~jpountz/packed_ints.html vs. 
http://people.apache.org/~jpountz/packed_ints2.html).

{{ReaderIterator}} and {{Writer}} now use the same code as 
{{Reader}}/{{Mutable}} bulk methods so they are likely to be much faster too. 
In addition, ReaderIterator now allows consumers to retrieve several values at 
the same time.

{{Direct*}} and {{Packed*ThreeBlocks}} had a lot of duplicate code that was not 
factorizable so I created scripts to generate them.

Something that might still slow down ReaderIterator (probably the most useful 
class for codecs) a bit is that ReaderIterator always reads one long at a time. 
Adding a method to bulk-read longs to DataInput (similarly to readBytes) might 
improve performance. This probably deserves an other issue in JIRA and can be 
done later.

 Make PackedInts usable by codecs
 

 Key: LUCENE-4161
 URL: https://issues.apache.org/jira/browse/LUCENE-4161
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/store
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
 Attachments: LUCENE-4161.patch


 Some codecs might be interested in using 
 PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values 
 efficiently.
 The problem is that the serialization format is self contained, and always 
 writes the name of the codec, its version, its number of bits per value and 
 its format. For example, if you want to use packed ints to store your 
 postings list, this is a lot of overhead (at least ~60 bytes per term, in 
 case you only use one Writer per term, more otherwise).
 Users should be able to externalize the storage of metadata to save space. 
 For example, to use PackedInts to store a postings list, one should be able 
 to store the codec name, its version and the number of bits per doc in the 
 header of the terms+postings list instead of having to write it once (or 
 more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org