[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775420#action_12775420 ]
Fuad Efendi edited comment on LUCENE-1990 at 11/10/09 4:10 PM: --------------------------------------------------------------- Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only "1". And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) P.S. Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases. 1. For a given document ID, find a value for a field 2. For a given query results, sort it by a field values 3. For a given query results, count "facet" for each field value I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM. Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays... was (Author: funtick): Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high. {code} Value0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 ... Value1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 ... Value2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 ... Value3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... Value4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 ... Value5 0 0 0 0 0 1 0 0 0 0 1 0 1 0 ... Value6 0 0 0 0 0 0 0 1 0 1 0 0 0 0 ... Value7 0 0 0 0 0 0 0 0 1 0 0 0 0 1 ... {code} - represented as Matrix (or as a Vector); for instance, first row means that Document1 and Document8 have Value0. And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values). CONSTRAINT: specifically for FieldCache, each Column must have the only "1". And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc(). hope I am reinventing bycycle :) > Add unsigned packed int impls in oal.util > ----------------------------------------- > > Key: LUCENE-1990 > URL: https://issues.apache.org/jira/browse/LUCENE-1990 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Priority: Minor > > There are various places in Lucene that could take advantage of an > efficient packed unsigned int/long impl. EG the terms dict index in > the standard codec in LUCENE-1458 could subsantially reduce it's RAM > usage. FieldCache.StringIndex could as well. And I think "load into > RAM" codecs like the one in TestExternalCodecs could use this too. > I'm picturing something very basic like: > {code} > interface PackedUnsignedLongs { > long get(long index); > void set(long index, long value); > } > {code} > Plus maybe an iterator for getting and maybe also for setting. If it > helps, most of the usages of this inside Lucene will be "write once" > so eg the set could make that an assumption/requirement. > And a factory somewhere: > {code} > PackedUnsignedLongs create(int count, long maxValue); > {code} > I think we should simply autogen the code (we can start from the > autogen code in LUCENE-1410), or, if there is an good existing impl > that has a compatible license that'd be great. > I don't have time near-term to do this... so if anyone has the itch, > please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org