[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795228#action_12795228 ]
Toke Eskildsen commented on LUCENE-1990: ---------------------------------------- I have very recently done some experiments with random-access packed positive integer arrays (we could also call them unsigned). The basic approach is to put the bits after each other in an int- or long-array, then extract the value by shifting and masking. Unfortunately I have not been able to determine a single winning strategy for space/speed trade-offs. I've tried four simple implementations: 1. Pack the bits right after each other. 2. Pack the bits right after each other but allow only 1, 2, 4, 8, 16 or 32 as value-bits. 3. Wrap an int[] or long[] and store the values directly (for comparison with 1 & 2). 4. Use int[] directly without any wrapping. (Code can be found at http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/src/dk/statsbiblioteket/summa/common/util/bits/ where #1 is BitsArrayPacked, #2 is BitsArrayAligned and #3 is BitsArrayInt) The obvious benefit from #2 is that the bits for values are always contained in a single block (int or long), which reduces the amount of operations for set and get considerably. Unfortunately this means that as soon as a value greater than 65535 needs to be stored, the internal representation will require 32 bits/value. This means no space-benefit compared to #3 while the speed penalty remains. A hybrid approach might be considered where the implementation is determined by the number of bits needed to store a value. I've done some performance tests of the implementations on an aging 1.8GHz single-core laptop (Dell 820). The code can be found at http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/test/dk/statsbiblioteket/summa/common/util/bits/BitsArrayPerformanceTest.java?revision=2038&view=markup In the tables below, the measurements for Packed, Aligned, Int, Constant and int[] are the total time in milliseconds to perform actionCount actions (either read or write). Random numbers (using the same seed) were used for index as well as value and the overhead of constructing the random numbers were subtracted from the measurements. "Constant" is a dummy where no values are stored and the same value is always returned on a get. It is used to measure method-call overhead. For arrays of 10M values, 6-33 million values can be read or written per second. If this is within acceptable limits, I'd be happy to try and make a contribution to Lucene based on the code. However, I probably won't find the time before February or March 2010. {code} actionCount arrayLength actionType valueMax Packed(#1) Aligned(#2) Int(#3) Constant int[](#4) 10000000 1000000 write 1 583 416 984 254 635 10000000 1000000 write 15 594 499 1172 286 843 10000000 1000000 write 255 604 478 1057 213 656 10000000 1000000 write 256 734 765 1062 109 729 10000000 1000000 write 65535 1036 802 1124 417 734 10000000 1000000 write 65536 1015 1130 1072 172 781 10000000 1000000 write 2097151 1020 1187 1052 223 614 10000000 1000000 write 2147483646 1136 1073 839 73 719 10000000 1000000 read 1 286 203 786 104 0 10000000 1000000 read 15 291 182 859 78 0 10000000 1000000 read 255 672 494 989 67 0 10000000 1000000 read 256 568 729 1104 93 0 10000000 1000000 read 65535 833 755 1088 99 0 10000000 1000000 read 65536 947 974 1062 104 0 10000000 1000000 read 2097151 999 963 1062 88 0 10000000 1000000 read 2147483646 1349 869 1260 84 0 10000000 10000000 write 1 833 568 1458 239 1229 10000000 10000000 write 15 1427 1255 1432 276 1615 10000000 10000000 write 255 1599 1578 1448 250 1244 10000000 10000000 write 256 1656 1520 1317 109 1109 10000000 10000000 write 65535 1734 1630 1385 245 1307 10000000 10000000 write 65536 1718 1640 1395 182 1208 10000000 10000000 write 2097151 1807 1781 1447 250 1291 10000000 10000000 write 2147483646 1718 1599 1281 73 1099 10000000 10000000 read 1 562 296 1301 109 0 10000000 10000000 read 15 1187 1198 1322 83 0 10000000 10000000 read 255 1421 1432 1526 99 0 10000000 10000000 read 256 1521 1583 1588 104 0 10000000 10000000 read 65535 1578 1427 1374 31 0 10000000 10000000 read 65536 1634 1499 1489 113 0 10000000 10000000 read 2097151 1625 1374 1468 224 0 10000000 10000000 read 2147483646 1609 1349 1499 83 0 {code} > Add unsigned packed int impls in oal.util > ----------------------------------------- > > Key: LUCENE-1990 > URL: https://issues.apache.org/jira/browse/LUCENE-1990 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Priority: Minor > > There are various places in Lucene that could take advantage of an > efficient packed unsigned int/long impl. EG the terms dict index in > the standard codec in LUCENE-1458 could subsantially reduce it's RAM > usage. FieldCache.StringIndex could as well. And I think "load into > RAM" codecs like the one in TestExternalCodecs could use this too. > I'm picturing something very basic like: > {code} > interface PackedUnsignedLongs { > long get(long index); > void set(long index, long value); > } > {code} > Plus maybe an iterator for getting and maybe also for setting. If it > helps, most of the usages of this inside Lucene will be "write once" > so eg the set could make that an assumption/requirement. > And a factory somewhere: > {code} > PackedUnsignedLongs create(int count, long maxValue); > {code} > I think we should simply autogen the code (we can start from the > autogen code in LUCENE-1410), or, if there is an good existing impl > that has a compatible license that'd be great. > I don't have time near-term to do this... so if anyone has the itch, > please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org