[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804723#action_12804723 ]
Michael McCandless commented on LUCENE-1990: -------------------------------------------- Good progress ! bq. I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available Actually that was intended to be a fast impl... the switch should be compiled to a direct lookup (maybe plus a conditional to catch the "default" case even though it will never happen...ugh). But I like your impl with no conditional at all. We should test both. bq. As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids. Also the file offsets into the terms dict, possibly the offsets in RAM into the terms dict character data (UTF8 byte[]). Also, when we do column stride fields, we allow storing values > int. I think we should stick with {{long get(index)}} for now. Other comments: * Maybe we should move all of this under oal.util.packed? (packedints? ints?) * I think we should remove getMaxValue() from the Reader interface? * Why create the IMPLEMENTATION enum? Why not simply return an [anonymous] instance of Writer? * Why not store bitsPerValue in the header instead of maxValue? EG maybe my maxValue is 7000, but because I'm using directShort, bitsPerValue is 16. Also, the maxValue at write time should not have to be known -- eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store. * I wonder if we should add an optional Object getDirectBackingArray(). The packed/aligned impls would return null, but the direct byte/short/int/long impls would return their array. This would allow callers to specialize upstream impls to do the direct array lookup without the cast-to-long (like how FieldComparator now has impls for byte,short,int,long). I suspect for column stride fields, when sorting by an integer field, on a 32bit arch, this would be a perf win. But: let's wait until we have CSFs, and we can test whether there really is a gain here.... * I think we shouldn't put a getWriter on every Reader impl... because it's a one to many mapping? Eg the format written by PackedWriter can be read by direct byte/short/int/long, Packed32/64. * For starters I don't think we should make reader impls that can read nbits > 31 bits with an int[] backing array. I think long[] backing array is fine. * I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64. * Aligned32/64 is very wasteful for certain nbits... I like the idea of "auto" to avoid risk that caller picks a bad combination. * I think for starters we should not make any reader impls that do remapping at load time. > Add unsigned packed int impls in oal.util > ----------------------------------------- > > Key: LUCENE-1990 > URL: https://issues.apache.org/jira/browse/LUCENE-1990 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Priority: Minor > Attachments: LUCENE-1990-te20100122.patch, LUCENE-1990.patch, > LUCENE-1990_PerformanceMeasurements20100104.zip > > > There are various places in Lucene that could take advantage of an > efficient packed unsigned int/long impl. EG the terms dict index in > the standard codec in LUCENE-1458 could subsantially reduce it's RAM > usage. FieldCache.StringIndex could as well. And I think "load into > RAM" codecs like the one in TestExternalCodecs could use this too. > I'm picturing something very basic like: > {code} > interface PackedUnsignedLongs { > long get(long index); > void set(long index, long value); > } > {code} > Plus maybe an iterator for getting and maybe also for setting. If it > helps, most of the usages of this inside Lucene will be "write once" > so eg the set could make that an assumption/requirement. > And a factory somewhere: > {code} > PackedUnsignedLongs create(int count, long maxValue); > {code} > I think we should simply autogen the code (we can start from the > autogen code in LUCENE-1410), or, if there is an good existing impl > that has a compatible license that'd be great. > I don't have time near-term to do this... so if anyone has the itch, > please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org