[jira] [Commented] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

Toke Eskildsen (JIRA) Fri, 29 Jun 2012 14:31:47 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404234#comment-13404234
 ]


Toke Eskildsen commented on LUCENE-4062:
----------------------------------------

+1 for replacing the Packed64SingleBlock with the optimized version. It is 
consistently better.

I have updated the charts at http://ekot.dk/misc/packedints/ with i7 
measurements. For the three non-i7-based Intel processors, Packed64SingleBlock 
seems clear-cut for the "Best possible performance without going all-out with 
Direct".

I tried running two tests in parallel on te-prime (see 
http://ekot.dk/misc/packedints/te-prime_parallel.html) and got very consistent 
results:
  * For set, contiguous strategy is faster than or equal to padded for bpvs 
3-12 and 16-21 (with bpv 1 and 2 being hard to measure). Since padded only 
supports bpv 1-10, 12, 16, 21 & 32 with non-conforming bpvs rounded up, this 
effectively means that there is zero gain in using padded over contiguous with 
regard to set on that machine.
  * For get, the picture is the same, except for process 2, where 17-21 was 
slower for contiguous than for padded. Sigh. I should of course have started 3 
processes so voting would have been easier. The same pattern, although noisier, 
can be seen on mars.

My preliminary conclusion for i7-processors is thus that Packed64Strategy is 
the right choice for "Best possible performance without going all-out with 
Direct". I am getting very curious about the untested AMD architecture now.

A by-product of the te-prime parallel test is that the amount of cache seems to 
matter little when it comes to selecting the most fitting implementation. Thank 
$diety for small favors.
                
> More fine-grained control over the packed integer implementation that is 
> chosen
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-4062
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4062
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/other
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>              Labels: performance
>             Fix For: 4.0, 5.0
>
>         Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
> LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
> LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, 
> Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, 
> PackedIntsBenchmark.java, PackedIntsBenchmark.java, 
> measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, 
> measurements_te_xeon.txt
>
>
> In order to save space, Lucene has two main PackedInts.Mutable implentations, 
> one that is very fast and is based on a byte/short/integer/long array 
> (Direct*) and another one which packs bits in a memory-efficient manner 
> (Packed*).
> The packed implementation tends to be much slower than the direct one, which 
> discourages some Lucene components to use it. On the other hand, if you store 
> 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
> If you accept to trade some space for speed, you could store 3 of these 21 
> bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
> advantage of this approach is that you never need to read more than one block 
> to read or write a value, so this can be significantly faster than Packed32 
> and Packed64 which always need to read/write two blocks in order to avoid 
> costly branches.
> I ran some tests, and for 10000000 21 bits values, this implementation takes 
> less than 2% more space and has 44% faster writes and 30% faster reads. The 
> 12 bits version (5 values per block) has the same performance improvement and 
> a 6% memory overhead compared to the packed implementation.
> In order to select the best implementation for a given integer size, I wrote 
> the {{PackedInts.getMutable(valueCount, bitsPerValue, 
> acceptableOverheadPerValue)}} method. This method select the fastest 
> implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
> per value. For example, if you accept an overhead of 20% 
> ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
> reasonable, here is what implementations would be selected:
>  * 1: Packed64SingleBlock1
>  * 2: Packed64SingleBlock2
>  * 3: Packed64SingleBlock3
>  * 4: Packed64SingleBlock4
>  * 5: Packed64SingleBlock5
>  * 6: Packed64SingleBlock6
>  * 7: Direct8
>  * 8: Direct8
>  * 9: Packed64SingleBlock9
>  * 10: Packed64SingleBlock10
>  * 11: Packed64SingleBlock12
>  * 12: Packed64SingleBlock12
>  * 13: Packed64
>  * 14: Direct16
>  * 15: Direct16
>  * 16: Direct16
>  * 17: Packed64
>  * 18: Packed64SingleBlock21
>  * 19: Packed64SingleBlock21
>  * 20: Packed64SingleBlock21
>  * 21: Packed64SingleBlock21
>  * 22: Packed64
>  * 23: Packed64
>  * 24: Packed64
>  * 25: Packed64
>  * 26: Packed64
>  * 27: Direct32
>  * 28: Direct32
>  * 29: Direct32
>  * 30: Direct32
>  * 31: Direct32
>  * 32: Direct32
>  * 33: Packed64
>  * 34: Packed64
>  * 35: Packed64
>  * 36: Packed64
>  * 37: Packed64
>  * 38: Packed64
>  * 39: Packed64
>  * 40: Packed64
>  * 41: Packed64
>  * 42: Packed64
>  * 43: Packed64
>  * 44: Packed64
>  * 45: Packed64
>  * 46: Packed64
>  * 47: Packed64
>  * 48: Packed64
>  * 49: Packed64
>  * 50: Packed64
>  * 51: Packed64
>  * 52: Packed64
>  * 53: Packed64
>  * 54: Direct64
>  * 55: Direct64
>  * 56: Direct64
>  * 57: Direct64
>  * 58: Direct64
>  * 59: Direct64
>  * 60: Direct64
>  * 61: Direct64
>  * 62: Direct64
> Under 32 bits per value, only 13, 17 and 22-26 bits per value would still 
> choose the slower Packed64 implementation. Allowing a 50% overhead would 
> prevent the packed implementation to be selected for bits per value under 32. 
> Allowing an overhead of 32 bits per value would make sure that a Direct* 
> implementation is always selected.
> Next steps would be to:
>  * make lucene components use this {{getMutable}} method and let users decide 
> what trade-off better suits them,
>  * write a Packed32SingleBlock implementation if necessary (I didn't do it 
> because I have no 32-bits computer to test the performance improvements).
> I think this would allow more fine-grained control over the speed/space 
> trade-off, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

Reply via email to