[ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403910#comment-13403910 ]
Adrien Grand commented on LUCENE-4161: -------------------------------------- bq. Can we find a better name for computeN? The meaning of {{n}} is actually a bit complicated. For every number of bits per value, there is a minimum number of blocks (b) / values (v) you need to write in order to reach the next block boundary: * 16 bits per value -> b=1, v=4 * 24 bits per value -> b=3, v=8 * 50 bits per value -> b=25, v=32 * 63 bits per value -> b=63, v = 64 * ... A bulk read consists in copying {{n*v}} values that are contained in {{n*b}} blocks into a long[] (higher values of {{n}} are likely to yield a better throughput) => this requires {{n * (b + v)}} longs in memory, this is why I compute {{n}} as {{ramBudget / (8 * (b + v))}} (since a long is 8 bytes). I called it {{n}} in the method name because I have no idea how to name it... "iterations", maybe? bq. I suspect, to use these for codecs, we will want to have versions that work on int[] values instead (everything we encode are ints: docIDs/deltas, term freqs, offsets, positions). I hesitated to do this since it would involve some code duplication, but I guess it can't be avoided if we want this API to be actually used... What additional methods do you think we need? * {{PackedReaderIterator.nextInts(int count)}} * others? bq. [static computeN], [code style] You are right, I will fix it! bq. Does this change the on-disk format? No, it doesn't. I will add unit tests for that... > Make PackedInts usable by codecs > -------------------------------- > > Key: LUCENE-4161 > URL: https://issues.apache.org/jira/browse/LUCENE-4161 > Project: Lucene - Java > Issue Type: Improvement > Components: core/store > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Attachments: LUCENE-4161.patch > > > Some codecs might be interested in using > PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values > efficiently. > The problem is that the serialization format is self contained, and always > writes the name of the codec, its version, its number of bits per value and > its format. For example, if you want to use packed ints to store your > postings list, this is a lot of overhead (at least ~60 bytes per term, in > case you only use one Writer per term, more otherwise). > Users should be able to externalize the storage of metadata to save space. > For example, to use PackedInts to store a postings list, one should be able > to store the codec name, its version and the number of bits per doc in the > header of the terms+postings list instead of having to write it once (or > more!) per term. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org