[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973691#comment-16973691
 ] 

Adrien Grand commented on LUCENE-9027:
--------------------------------------

Following a suggestion by [~rcmuir] I specialized the patch for little-endian 
architectures. Big-endian architectures will get a slowdown due to the need to 
do a Long#reverseBytes on all read longs, but this probably doesn't matter much 
and helps simplify DataInput.

> SIMD-based decoding of postings lists
> -------------------------------------
>
>                 Key: LUCENE-9027
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9027
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>    a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>    current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>    recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>    compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>    2 ints in a long in order to decompress multiple values at once.
> Benchmark                                       (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               1           LE  
> thrpt    5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               1           BE  
> thrpt    5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               2           LE  
> thrpt    5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               2           BE  
> thrpt    5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               3           LE  
> thrpt    5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               3           BE  
> thrpt    5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               4           LE  
> thrpt    5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               4           BE  
> thrpt    5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               5           LE  
> thrpt    5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               5           BE  
> thrpt    5  11.226 ± 0.088  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               6           LE  
> thrpt    5   9.791 ± 0.305  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               6           BE  
> thrpt    5   9.403 ± 3.598  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               7           LE  
> thrpt    5  10.256 ± 0.211  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               7           BE  
> thrpt    5  10.314 ± 0.382  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               8           LE  
> thrpt    5  16.516 ± 0.380  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               8           BE  
> thrpt    5  16.375 ± 0.427  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               9           LE  
> thrpt    5   9.067 ± 0.066  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               9           BE  
> thrpt    5   9.078 ± 0.178  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              10           LE  
> thrpt    5   8.913 ± 0.074  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              10           BE  
> thrpt    5   8.893 ± 0.101  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              11           LE  
> thrpt    5   7.908 ± 0.118  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              11           BE  
> thrpt    5   7.864 ± 0.097  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              12           LE  
> thrpt    5   9.220 ± 0.103  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              12           BE  
> thrpt    5   9.186 ± 0.241  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              13           LE  
> thrpt    5   7.119 ± 0.071  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              13           BE  
> thrpt    5   7.066 ± 0.059  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              14           LE  
> thrpt    5  12.483 ± 0.171  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              14           BE  
> thrpt    5  12.473 ± 0.117  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              15           LE  
> thrpt    5   6.202 ± 0.192  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              15           BE  
> thrpt    5   6.187 ± 0.399  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              16           LE  
> thrpt    5  12.798 ± 0.249  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              16           BE  
> thrpt    5  12.987 ± 0.208  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               1           LE  
> thrpt    5   7.248 ± 0.096  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               1           BE  
> thrpt    5   7.292 ± 0.114  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               2           LE  
> thrpt    5   8.923 ± 0.099  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               2           BE  
> thrpt    5   8.899 ± 0.028  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               3           LE  
> thrpt    5   9.192 ± 0.082  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               3           BE  
> thrpt    5   9.090 ± 0.066  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               4           LE  
> thrpt    5   7.947 ± 0.039  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               4           BE  
> thrpt    5   7.809 ± 0.298  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               5           LE  
> thrpt    5   8.342 ± 0.568  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               5           BE  
> thrpt    5   8.259 ± 0.572  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               6           LE  
> thrpt    5  15.594 ± 0.149  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               6           BE  
> thrpt    5  14.012 ± 0.160  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               7           LE  
> thrpt    5  12.686 ± 0.271  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               7           BE  
> thrpt    5  12.806 ± 0.160  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               8           LE  
> thrpt    5  13.571 ± 0.135  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               8           BE  
> thrpt    5  13.312 ± 0.110  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               9           LE  
> thrpt    5  11.812 ± 0.108  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               9           BE  
> thrpt    5  12.874 ± 0.168  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              10           LE  
> thrpt    5  12.882 ± 0.114  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              10           BE  
> thrpt    5  12.142 ± 0.091  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              11           LE  
> thrpt    5  12.302 ± 0.111  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              11           BE  
> thrpt    5  10.762 ± 0.250  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              12           LE  
> thrpt    5  12.505 ± 0.070  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              12           BE  
> thrpt    5  12.149 ± 0.083  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              13           LE  
> thrpt    5  11.159 ± 0.341  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              13           BE  
> thrpt    5  10.395 ± 0.222  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              14           LE  
> thrpt    5  11.004 ± 0.058  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              14           BE  
> thrpt    5  10.312 ± 0.369  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              15           LE  
> thrpt    5  11.236 ± 0.117  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              15           BE  
> thrpt    5   9.792 ± 0.202  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              16           LE  
> thrpt    5  10.607 ± 0.105  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              16           BE  
> thrpt    5  10.340 ± 0.070  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         1           LE  
> thrpt    5  20.925 ± 0.368  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         1           BE  
> thrpt    5  13.396 ± 0.485  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         2           LE  
> thrpt    5  20.628 ± 0.494  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         2           BE  
> thrpt    5  13.584 ± 0.194  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         3           LE  
> thrpt    5  19.932 ± 1.609  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         3           BE  
> thrpt    5  13.296 ± 0.095  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         4           LE  
> thrpt    5  21.065 ± 0.767  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         4           BE  
> thrpt    5  13.557 ± 0.051  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         5           LE  
> thrpt    5  19.630 ± 0.067  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         5           BE  
> thrpt    5  12.916 ± 0.186  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         6           LE  
> thrpt    5  20.253 ± 0.701  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         6           BE  
> thrpt    5  12.820 ± 0.048  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         7           LE  
> thrpt    5  18.944 ± 0.160  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         7           BE  
> thrpt    5  12.562 ± 0.128  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         8           LE  
> thrpt    5  22.778 ± 2.023  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         8           BE  
> thrpt    5  13.658 ± 0.158  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         9           LE  
> thrpt    5  18.527 ± 0.169  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         9           BE  
> thrpt    5  12.045 ± 0.111  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        10           LE  
> thrpt    5  16.610 ± 0.997  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        10           BE  
> thrpt    5  11.208 ± 0.087  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        11           LE  
> thrpt    5  17.961 ± 0.080  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        11           BE  
> thrpt    5  11.594 ± 0.084  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        12           LE  
> thrpt    5  16.980 ± 2.372  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        12           BE  
> thrpt    5  11.135 ± 0.050  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        13           LE  
> thrpt    5  17.592 ± 0.269  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        13           BE  
> thrpt    5  11.132 ± 0.227  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        14           LE  
> thrpt    5  16.964 ± 0.423  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        14           BE  
> thrpt    5  10.953 ± 0.326  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        15           LE  
> thrpt    5  17.972 ± 0.572  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        15           BE  
> thrpt    5  10.872 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        16           LE  
> thrpt    5  24.152 ± 0.213  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        16           BE  
> thrpt    5  12.984 ± 0.348  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   1           LE  
> thrpt    5  14.567 ± 0.714  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   1           BE  
> thrpt    5  10.541 ± 0.079  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   2           LE  
> thrpt    5  15.395 ± 0.687  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   2           BE  
> thrpt    5  11.142 ± 0.052  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   3           LE  
> thrpt    5  15.802 ± 0.623  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   3           BE  
> thrpt    5  10.656 ± 0.278  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   4           LE  
> thrpt    5  17.732 ± 0.276  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   4           BE  
> thrpt    5  11.289 ± 0.209  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   5           LE  
> thrpt    5  16.230 ± 0.389  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   5           BE  
> thrpt    5  10.216 ± 0.184  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   6           LE  
> thrpt    5  16.478 ± 0.682  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   6           BE  
> thrpt    5  10.379 ± 0.157  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   8           LE  
> thrpt    5  18.222 ± 0.388  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   8           BE  
> thrpt    5  11.153 ± 0.619  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                  10           LE  
> thrpt    5  15.138 ± 0.321  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                  10           BE  
> thrpt    5   9.384 ± 0.671  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                  16           LE  
> thrpt    5  20.776 ± 0.397  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                  16           BE  
> thrpt    5  10.199 ± 0.146  ops/us
> PackedIntsDecodeBenchmark.readLongs                          1           LE  
> thrpt    5  30.220 ± 0.652  ops/us
> PackedIntsDecodeBenchmark.readLongs                          1           BE  
> thrpt    5  16.324 ± 0.226  ops/us
> PackedIntsDecodeBenchmark.readLongs                          2           LE  
> thrpt    5  30.952 ± 0.329  ops/us
> PackedIntsDecodeBenchmark.readLongs                          2           BE  
> thrpt    5  16.492 ± 0.397  ops/us
> PackedIntsDecodeBenchmark.readLongs                          3           LE  
> thrpt    5  30.156 ± 0.979  ops/us
> PackedIntsDecodeBenchmark.readLongs                          3           BE  
> thrpt    5  16.273 ± 0.441  ops/us
> PackedIntsDecodeBenchmark.readLongs                          4           LE  
> thrpt    5  29.925 ± 0.718  ops/us
> PackedIntsDecodeBenchmark.readLongs                          4           BE  
> thrpt    5  15.930 ± 0.350  ops/us
> PackedIntsDecodeBenchmark.readLongs                          5           LE  
> thrpt    5  29.773 ± 0.979  ops/us
> PackedIntsDecodeBenchmark.readLongs                          5           BE  
> thrpt    5  15.775 ± 0.257  ops/us
> PackedIntsDecodeBenchmark.readLongs                          6           LE  
> thrpt    5  29.591 ± 1.285  ops/us
> PackedIntsDecodeBenchmark.readLongs                          6           BE  
> thrpt    5  15.732 ± 0.226  ops/us
> PackedIntsDecodeBenchmark.readLongs                          7           LE  
> thrpt    5  29.708 ± 0.909  ops/us
> PackedIntsDecodeBenchmark.readLongs                          7           BE  
> thrpt    5  15.433 ± 0.562  ops/us
> PackedIntsDecodeBenchmark.readLongs                          8           LE  
> thrpt    5  29.828 ± 0.689  ops/us
> PackedIntsDecodeBenchmark.readLongs                          8           BE  
> thrpt    5  15.390 ± 0.188  ops/us
> PackedIntsDecodeBenchmark.readLongs                          9           LE  
> thrpt    5  29.127 ± 0.309  ops/us
> PackedIntsDecodeBenchmark.readLongs                          9           BE  
> thrpt    5  15.180 ± 0.199  ops/us
> PackedIntsDecodeBenchmark.readLongs                         10           LE  
> thrpt    5  29.085 ± 0.538  ops/us
> PackedIntsDecodeBenchmark.readLongs                         10           BE  
> thrpt    5  14.887 ± 1.687  ops/us
> PackedIntsDecodeBenchmark.readLongs                         11           LE  
> thrpt    5  28.904 ± 0.329  ops/us
> PackedIntsDecodeBenchmark.readLongs                         11           BE  
> thrpt    5  14.936 ± 0.119  ops/us
> PackedIntsDecodeBenchmark.readLongs                         12           LE  
> thrpt    5  29.025 ± 0.299  ops/us
> PackedIntsDecodeBenchmark.readLongs                         12           BE  
> thrpt    5  14.685 ± 0.154  ops/us
> PackedIntsDecodeBenchmark.readLongs                         13           LE  
> thrpt    5  28.963 ± 0.244  ops/us
> PackedIntsDecodeBenchmark.readLongs                         13           BE  
> thrpt    5  14.569 ± 0.100  ops/us
> PackedIntsDecodeBenchmark.readLongs                         14           LE  
> thrpt    5  28.584 ± 1.409  ops/us
> PackedIntsDecodeBenchmark.readLongs                         14           BE  
> thrpt    5  14.340 ± 0.594  ops/us
> PackedIntsDecodeBenchmark.readLongs                         15           LE  
> thrpt    5  28.744 ± 0.314  ops/us
> PackedIntsDecodeBenchmark.readLongs                         15           BE  
> thrpt    5  14.222 ± 0.105  ops/us
> PackedIntsDecodeBenchmark.readLongs                         16           LE  
> thrpt    5  26.638 ± 0.452  ops/us
> PackedIntsDecodeBenchmark.readLongs                         16           BE  
> thrpt    5  13.906 ± 0.604  ops/us
> {noformat}
> The thing that is a bit frustrating is that the best throughputs are obtained 
> on a ByteBuffer that is configured to use the little endian byte order (which 
> is the native byte order of my machine) while Java/Lucene default to big 
> endian. So if we want that kind of throughput we'll probably need to add ways 
> to read data in the native byte order in the IndexInput API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to