[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799701#action_12799701 ]
Michael McCandless commented on LUCENE-1990: -------------------------------------------- bq. The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses. Got it. bq. Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory. OK, though, I think we should somehow exclude these CPU cache effects from the test. In a real production situation, lots of other things are competing for that cache, so I think the mostly-cache-miss case is most realistic here. Actually a good way to do that is to test it in a wider context, eg once I can get LUCENE-2186 "integrated", we can test sorting by int field with these different ways of encoding. {quote} I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value). {quote} OK I have a start at this under LUCENE-2186. I'll try to clean up & post here... {quote} The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be. The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty. {quote} I think the writing API can be a write-once API, where I declare the maxValue I will write. This fits with the indexing chain, where we buffer values in RAM and then flush to the segmetn files. This way we don't need the complexity of having to resize the bits internally when a new max value is seen. So, the factory would take number of values we will write, the max value, and I guess an enum for Packed/Aligned/DedicatedArray, and return a simple writer that just exposes an "add(long value)" method? bq. A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution. Hmm... I guess we could still support values requiring more than 32 bits, encoded into int[], but it's more hairy as the value could span 2 or 3 ints. Probably we should specialize/default the factory selection based on whether the JVM is 32/64 bit? And, whether the bit size is <= 32. bq. All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation. Yes. {quote} I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache. There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are: 1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed. On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third. 2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast. 3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction. Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable? {quote} I think we should agree on an API that LUCENE-2186 can use to write/read packed ints, then get our two patches talking. I'll pull out the barebones packed ints I'm currently using, and post here... then let's merge/iterate to a common API, so I can cutover to the patch from here in LUCENE-2186? > Add unsigned packed int impls in oal.util > ----------------------------------------- > > Key: LUCENE-1990 > URL: https://issues.apache.org/jira/browse/LUCENE-1990 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Priority: Minor > Attachments: LUCENE-1990_PerformanceMeasurements20100104.zip > > > There are various places in Lucene that could take advantage of an > efficient packed unsigned int/long impl. EG the terms dict index in > the standard codec in LUCENE-1458 could subsantially reduce it's RAM > usage. FieldCache.StringIndex could as well. And I think "load into > RAM" codecs like the one in TestExternalCodecs could use this too. > I'm picturing something very basic like: > {code} > interface PackedUnsignedLongs { > long get(long index); > void set(long index, long value); > } > {code} > Plus maybe an iterator for getting and maybe also for setting. If it > helps, most of the usages of this inside Lucene will be "write once" > so eg the set could make that an assumption/requirement. > And a factory somewhere: > {code} > PackedUnsignedLongs create(int count, long maxValue); > {code} > I think we should simply autogen the code (we can start from the > autogen code in LUCENE-1410), or, if there is an good existing impl > that has a compatible license that'd be great. > I don't have time near-term to do this... so if anyone has the itch, > please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org