[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803635#action_12803635 ]
Michael McCandless commented on LUCENE-1990: -------------------------------------------- bq. I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend. Nice! Sounds like good progress Toke! bq. The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format I was only storing maxValue as a convenience for the layer above -- we don't need to do that -- I think storing format (packed, aligned32, aligned64) and bitsPerValue makes sense. bq. Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed. Right, but with the challenge (if we use 32bit backing array) of properly handling the nbits>32 case (this is perfectly doable... "it's just software" ;) ). bq. As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned. I had chosen to match IndexOutput/Inputs's byte order (big-endian) so that the packed format naturally reads back with IndexInput's readLong/Int/Short (I added a readShort). I'm assuming for these special cases that dedicated Reader impls, with byte[], short[], int[], long[] backing array, is faster than eg backing with a long[] and shift/masking per lookup. But eg for the nbits=3 case, aligned 32/64 would ensure that no value spans across two underlying entries in the backing array (wasting some bits of storage in exchange). Whereas the nbits=2 or 4 cases would naturally be aligned anyway... One question: the Reader api is now this: {code} long get(int index); {code} Which is convenient since obviously long can accommodate all of the underlying possible nbits, but... for small nbits values, this logically entails a cast. EG say nbits=8, so it's a direct byte[] backing array. get() must cast up to long, and caller must operate with long... I'm wondering whether that forced casting is going to hurt performance enough to make us want to have dedicated precision (8, 16, 32, 64) Reader interfaces.... > Add unsigned packed int impls in oal.util > ----------------------------------------- > > Key: LUCENE-1990 > URL: https://issues.apache.org/jira/browse/LUCENE-1990 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Priority: Minor > Attachments: LUCENE-1990.patch, > LUCENE-1990_PerformanceMeasurements20100104.zip > > > There are various places in Lucene that could take advantage of an > efficient packed unsigned int/long impl. EG the terms dict index in > the standard codec in LUCENE-1458 could subsantially reduce it's RAM > usage. FieldCache.StringIndex could as well. And I think "load into > RAM" codecs like the one in TestExternalCodecs could use this too. > I'm picturing something very basic like: > {code} > interface PackedUnsignedLongs { > long get(long index); > void set(long index, long value); > } > {code} > Plus maybe an iterator for getting and maybe also for setting. If it > helps, most of the usages of this inside Lucene will be "write once" > so eg the set could make that an assumption/requirement. > And a factory somewhere: > {code} > PackedUnsignedLongs create(int count, long maxValue); > {code} > I think we should simply autogen the code (we can start from the > autogen code in LUCENE-1410), or, if there is an good existing impl > that has a compatible license that'd be great. > I don't have time near-term to do this... so if anyone has the itch, > please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org