[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

Michael McCandless (JIRA) Wed, 13 Jan 2010 04:02:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799701#action_12799701
 ]


Michael McCandless commented on LUCENE-1990:
--------------------------------------------

bq. The first section if for 1M values in the structure, the second is for 10M. 
As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the 
increased processing time for the seemingly same amount of work is an effect of 
more cache-misses.

Got it.

bq. Caching also accounts for why the packed version is sometimes better than 
the aligned. For values representable as 9 or 17 bits, the aligned version 
needs 16 and 32 bits respectively. In the case with 10M values, the packed 
version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned 
uses 2MB and 4MB respectively. The simpler logic of the aligned version does 
not compensate enough for the higher amount of trips around main memory.

OK, though, I think we should somehow exclude these CPU cache effects
from the test.  In a real production situation, lots of other things
are competing for that cache, so I think the mostly-cache-miss case is
most realistic here.

Actually a good way to do that is to test it in a wider context, eg
once I can get LUCENE-2186 "integrated", we can test sorting by int
field with these different ways of encoding.

{quote}
I did not generate any specialized code for the aligned case: No matter the 
number of bits/value, the amount of shifts, masks and ors is always the same. 
If the number of bits/value is known beforehand, specialized cases should be 
used (I made a factory that selects between packed, aligned and direct (#3), 
depending on the number of bits/value).
{quote}

OK I have a start at this under LUCENE-2186.  I'll try to clean up &
post here...

{quote}
The reason for not doing so in the first place is that I wanted to let the 
structure auto-adjust the bits/value when a new value was added. Having 
different implementations encapsulated in the same class means another level of 
indirection or conditionals, both of which I wanted to avoid for performance 
reasons. That being said, I haven't tested how much of a penalty this would be.

The standard use case seems to be some sort of update-round, after which no 
updates are performed. Having a cleanupAndOptimize-call that potentially 
creates a new and optimized structure, would fit well into this and would avoid 
the indirection / conditional penalty.
{quote}

I think the writing API can be a write-once API, where I declare the
maxValue I will write.  This fits with the indexing chain, where we
buffer values in RAM and then flush to the segmetn files.

This way we don't need the complexity of having to resize the bits
internally when a new max value is seen.

So, the factory would take number of values we will write, the max
value, and I guess an enum for Packed/Aligned/DedicatedArray, and
return a simple writer that just exposes an "add(long value)" method?

bq. A whole other matter is long vs. ints. I've tried using longs instead of 
ints as the backing array and the penalty on my 32bit processor was very high 
(I need to make some tests on this). If it must be possible to set and get 
longs, it's hard to avoid using long[] as the internal structure, but if ints 
are accepted as the only valid values, selecting long[] as backing array for 64 
bit machines and int[] for 32 bit, might be the solution.

Hmm... I guess we could still support values requiring more than 32
bits, encoded into int[], but it's more hairy as the value could span
2 or 3 ints.

Probably we should specialize/default the factory selection based on
whether the JVM is 32/64 bit?  And, whether the bit size is <= 32.

bq. All this calls for a factory-approach to hide the fairly complex task of 
choosing the right implementation.

Yes.

{quote}
I made some small tweaks to improve performance and added long[]-backed 
versions of Packed (optimal space) and Aligned (no values span underlying 
blocks), the ran the performance tests on 5 different computers. It seems very 
clear that level 2 cache (and presumably RAM-speed, but I do not know how to 
determine that without root-access on a Linux box) plays a bigger role for 
access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about 
half as fast than a 1.8GHz laptop with 2MB of level 2 cache.
There is a whole lot of measurements and it is getting hard to digest. I've 
attached logs from the 5 computers, should anyone want to have a look. Some 
observations are:

1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on 
the number of values in the array. For less than a million values, it is 
severe: The long[]-version if 30-60% slower, depending on whether packed or 
aligned values are used. Above that, it was 10% slower for Aligned, 25% slower 
for Packed.
On the other hand, 64 bit machines dos not seem to care that much whether int[] 
or long[] is used: There was 10% win for arrays below 1M for one machine, 50% 
for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a 
small loss of below 1% for all lenghts above 10K for a third.

2. There's a fast drop-off in speed when the array reaches a certain size that 
is correlated to level 2 cache size. After that, the speed does not decrease 
much when the array grows. This also affects direct writes to an int[] and has 
the interesting implication that a packed array out-performs the direct access 
approach for writes in a number of cases. For reads, there's no contest: Direct 
access to int[] is blazingly fast.

3. The access-speed of the different implementations converges when the number 
of values in the array rises (think 10M+ values): The slow round-trip to main 
memory dwarfs the logic used for value-extraction.

Observation #3 supports Mike McCandless choice of going for the packed approach 
and #1 suggests using int[] as the internal structure for now. Using int[] as 
internal structure makes if unfeasible to accept longs as input (or rather: 
longs with more than 32 significant bits). I don't know if this is acceptable?
{quote}

I think we should agree on an API that LUCENE-2186 can use to
write/read packed ints, then get our two patches talking.  I'll pull
out the barebones packed ints I'm currently using, and post
here... then let's merge/iterate to a common API, so I can cutover to
the patch from here in LUCENE-2186?


> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
>                 Key: LUCENE-1990
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1990
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1990_PerformanceMeasurements20100104.zip
>
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl.  EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage.  FieldCache.StringIndex could as well.  And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs  {
>   long get(long index);
>   void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting.  If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
>   PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

Reply via email to