[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

Adrien Grand (Jira) Wed, 13 Nov 2019 05:09:49 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973328#comment-16973328
 ]


Adrien Grand commented on LUCENE-9027:
--------------------------------------

I guess it depends what kind of analytics, I expect counts to get the best 
speedup indeed, since they are only about decoding postings and counting 
matches. However other analytics use-cases like facets still need to read 
values for a doc-value field, which likely drowns the speedup a bit, like 
sorting. We have some benchmarks for Elasticsearch that run nightly at 
https://benchmarks.elastic.co but most analytics queries like histograms or 
terms facets run on a match_all query, so I'm not expecting to see any speedup.

I've done some more research regarding the prefix sum. Unfortunately, I didn't 
manage to get C2 to generate SIMD instructions for the prefix sum (even with 
gaps of 2 or 4 instead of 1). The best option I have for now works by hooking 
into the decoding logic and summing up longs when they still represent 2 packed 
integers (which effectively computes the prefix sum for [0:64) and [64:128) in 
parallel and then add the value at index 63 to all values at indices [64:128). 
This last step is vectorized by the JVM. See benchmark results at the bottom of 
https://github.com/jpountz/decode-128-ints-benchmark if you are interested. I 
wish we could do better, but it's still nice that we can both decompress and 
compute the prefix sum faster that we can just decompress with the current 
codec on average. Having the sum pre-computed also helps simplify some 
conditions in the PostingsEnum implementations of nextDoc/advance.

Here is the result of a run on wikibigall with the current pull request. I 
included some sorting tasks as well to see how much they benefit from this 
change

{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                  Fuzzy2       99.40     (10.4%)       95.46      (6.9%)   
-4.0% ( -19% -   14%)
                  IntNRQ      966.00      (2.4%)      977.27      (2.1%)    
1.2% (  -3% -    5%)
                  Fuzzy1       97.25      (7.8%)       99.66      (9.4%)    
2.5% ( -13% -   21%)
              OrHighHigh       84.95      (3.7%)       87.40      (4.2%)    
2.9% (  -4% -   11%)
                    Term     1316.61      (3.1%)     1357.55      (4.8%)    
3.1% (  -4% -   11%)
   HighTermDayOfYearSort       35.89      (6.3%)       37.66      (4.2%)    
4.9% (  -5% -   16%)
                  Phrase       60.22      (2.1%)       63.45      (4.3%)    
5.4% (  -1% -   12%)
       HighTermMonthSort       63.83      (9.1%)       67.40     (10.7%)    
5.6% ( -12% -   27%)
             AndHighHigh       27.27      (3.4%)       28.80      (4.0%)    
5.6% (  -1% -   13%)
            SloppyPhrase        1.35      (7.4%)        1.43      (9.3%)    
6.2% (  -9% -   24%)
         AndHighOrMedMed       26.38      (1.7%)       28.17      (2.6%)    
6.8% (   2% -   11%)
        IntervalsOrdered       21.99      (2.8%)       23.57      (1.9%)    
7.2% (   2% -   12%)
               OrHighMed       39.27      (2.8%)       42.32      (3.0%)    
7.8% (   1% -   13%)
                SpanNear        9.74      (3.2%)       10.71      (2.0%)   
10.0% (   4% -   15%)
              AndHighMed       59.40      (3.1%)       65.44      (3.6%)   
10.2% (   3% -   17%)
                Wildcard      127.70      (7.7%)      140.93      (3.3%)   
10.4% (   0% -   23%)
        AndMedOrHighHigh       30.65      (1.4%)       34.15      (2.1%)   
11.4% (   7% -   15%)
                 Prefix3       46.12      (9.9%)       53.01     (10.1%)   
14.9% (  -4% -   38%)
{noformat}

I now plan to focus of getting this into a mergeable state. 

> SIMD-based decoding of postings lists
> -------------------------------------
>
>                 Key: LUCENE-9027
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9027
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>    a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>    current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>    recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>    compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>    2 ints in a long in order to decompress multiple values at once.
> Benchmark                                       (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               1           LE  
> thrpt    5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               1           BE  
> thrpt    5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               2           LE  
> thrpt    5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               2           BE  
> thrpt    5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               3           LE  
> thrpt    5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               3           BE  
> thrpt    5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               4           LE  
> thrpt    5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               4           BE  
> thrpt    5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               5           LE  
> thrpt    5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               5           BE  
> thrpt    5  11.226 ± 0.088  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               6           LE  
> thrpt    5   9.791 ± 0.305  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               6           BE  
> thrpt    5   9.403 ± 3.598  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               7           LE  
> thrpt    5  10.256 ± 0.211  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               7           BE  
> thrpt    5  10.314 ± 0.382  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               8           LE  
> thrpt    5  16.516 ± 0.380  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               8           BE  
> thrpt    5  16.375 ± 0.427  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               9           LE  
> thrpt    5   9.067 ± 0.066  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes               9           BE  
> thrpt    5   9.078 ± 0.178  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              10           LE  
> thrpt    5   8.913 ± 0.074  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              10           BE  
> thrpt    5   8.893 ± 0.101  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              11           LE  
> thrpt    5   7.908 ± 0.118  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              11           BE  
> thrpt    5   7.864 ± 0.097  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              12           LE  
> thrpt    5   9.220 ± 0.103  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              12           BE  
> thrpt    5   9.186 ± 0.241  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              13           LE  
> thrpt    5   7.119 ± 0.071  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              13           BE  
> thrpt    5   7.066 ± 0.059  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              14           LE  
> thrpt    5  12.483 ± 0.171  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              14           BE  
> thrpt    5  12.473 ± 0.117  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              15           LE  
> thrpt    5   6.202 ± 0.192  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              15           BE  
> thrpt    5   6.187 ± 0.399  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              16           LE  
> thrpt    5  12.798 ± 0.249  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes              16           BE  
> thrpt    5  12.987 ± 0.208  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               1           LE  
> thrpt    5   7.248 ± 0.096  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               1           BE  
> thrpt    5   7.292 ± 0.114  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               2           LE  
> thrpt    5   8.923 ± 0.099  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               2           BE  
> thrpt    5   8.899 ± 0.028  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               3           LE  
> thrpt    5   9.192 ± 0.082  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               3           BE  
> thrpt    5   9.090 ± 0.066  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               4           LE  
> thrpt    5   7.947 ± 0.039  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               4           BE  
> thrpt    5   7.809 ± 0.298  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               5           LE  
> thrpt    5   8.342 ± 0.568  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               5           BE  
> thrpt    5   8.259 ± 0.572  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               6           LE  
> thrpt    5  15.594 ± 0.149  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               6           BE  
> thrpt    5  14.012 ± 0.160  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               7           LE  
> thrpt    5  12.686 ± 0.271  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               7           BE  
> thrpt    5  12.806 ± 0.160  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               8           LE  
> thrpt    5  13.571 ± 0.135  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               8           BE  
> thrpt    5  13.312 ± 0.110  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               9           LE  
> thrpt    5  11.812 ± 0.108  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs               9           BE  
> thrpt    5  12.874 ± 0.168  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              10           LE  
> thrpt    5  12.882 ± 0.114  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              10           BE  
> thrpt    5  12.142 ± 0.091  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              11           LE  
> thrpt    5  12.302 ± 0.111  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              11           BE  
> thrpt    5  10.762 ± 0.250  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              12           LE  
> thrpt    5  12.505 ± 0.070  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              12           BE  
> thrpt    5  12.149 ± 0.083  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              13           LE  
> thrpt    5  11.159 ± 0.341  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              13           BE  
> thrpt    5  10.395 ± 0.222  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              14           LE  
> thrpt    5  11.004 ± 0.058  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              14           BE  
> thrpt    5  10.312 ± 0.369  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              15           LE  
> thrpt    5  11.236 ± 0.117  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              15           BE  
> thrpt    5   9.792 ± 0.202  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              16           LE  
> thrpt    5  10.607 ± 0.105  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromLongs              16           BE  
> thrpt    5  10.340 ± 0.070  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         1           LE  
> thrpt    5  20.925 ± 0.368  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         1           BE  
> thrpt    5  13.396 ± 0.485  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         2           LE  
> thrpt    5  20.628 ± 0.494  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         2           BE  
> thrpt    5  13.584 ± 0.194  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         3           LE  
> thrpt    5  19.932 ± 1.609  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         3           BE  
> thrpt    5  13.296 ± 0.095  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         4           LE  
> thrpt    5  21.065 ± 0.767  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         4           BE  
> thrpt    5  13.557 ± 0.051  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         5           LE  
> thrpt    5  19.630 ± 0.067  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         5           BE  
> thrpt    5  12.916 ± 0.186  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         6           LE  
> thrpt    5  20.253 ± 0.701  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         6           BE  
> thrpt    5  12.820 ± 0.048  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         7           LE  
> thrpt    5  18.944 ± 0.160  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         7           BE  
> thrpt    5  12.562 ± 0.128  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         8           LE  
> thrpt    5  22.778 ± 2.023  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         8           BE  
> thrpt    5  13.658 ± 0.158  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         9           LE  
> thrpt    5  18.527 ± 0.169  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                         9           BE  
> thrpt    5  12.045 ± 0.111  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        10           LE  
> thrpt    5  16.610 ± 0.997  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        10           BE  
> thrpt    5  11.208 ± 0.087  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        11           LE  
> thrpt    5  17.961 ± 0.080  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        11           BE  
> thrpt    5  11.594 ± 0.084  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        12           LE  
> thrpt    5  16.980 ± 2.372  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        12           BE  
> thrpt    5  11.135 ± 0.050  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        13           LE  
> thrpt    5  17.592 ± 0.269  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        13           BE  
> thrpt    5  11.132 ± 0.227  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        14           LE  
> thrpt    5  16.964 ± 0.423  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        14           BE  
> thrpt    5  10.953 ± 0.326  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        15           LE  
> thrpt    5  17.972 ± 0.572  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        15           BE  
> thrpt    5  10.872 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        16           LE  
> thrpt    5  24.152 ± 0.213  ops/us
> PackedIntsDecodeBenchmark.decodeSIMD                        16           BE  
> thrpt    5  12.984 ± 0.348  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   1           LE  
> thrpt    5  14.567 ± 0.714  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   1           BE  
> thrpt    5  10.541 ± 0.079  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   2           LE  
> thrpt    5  15.395 ± 0.687  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   2           BE  
> thrpt    5  11.142 ± 0.052  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   3           LE  
> thrpt    5  15.802 ± 0.623  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   3           BE  
> thrpt    5  10.656 ± 0.278  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   4           LE  
> thrpt    5  17.732 ± 0.276  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   4           BE  
> thrpt    5  11.289 ± 0.209  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   5           LE  
> thrpt    5  16.230 ± 0.389  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   5           BE  
> thrpt    5  10.216 ± 0.184  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   6           LE  
> thrpt    5  16.478 ± 0.682  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   6           BE  
> thrpt    5  10.379 ± 0.157  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   8           LE  
> thrpt    5  18.222 ± 0.388  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                   8           BE  
> thrpt    5  11.153 ± 0.619  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                  10           LE  
> thrpt    5  15.138 ± 0.321  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                  10           BE  
> thrpt    5   9.384 ± 0.671  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                  16           LE  
> thrpt    5  20.776 ± 0.397  ops/us
> PackedIntsDecodeBenchmark.decodeSimpleSIMD                  16           BE  
> thrpt    5  10.199 ± 0.146  ops/us
> PackedIntsDecodeBenchmark.readLongs                          1           LE  
> thrpt    5  30.220 ± 0.652  ops/us
> PackedIntsDecodeBenchmark.readLongs                          1           BE  
> thrpt    5  16.324 ± 0.226  ops/us
> PackedIntsDecodeBenchmark.readLongs                          2           LE  
> thrpt    5  30.952 ± 0.329  ops/us
> PackedIntsDecodeBenchmark.readLongs                          2           BE  
> thrpt    5  16.492 ± 0.397  ops/us
> PackedIntsDecodeBenchmark.readLongs                          3           LE  
> thrpt    5  30.156 ± 0.979  ops/us
> PackedIntsDecodeBenchmark.readLongs                          3           BE  
> thrpt    5  16.273 ± 0.441  ops/us
> PackedIntsDecodeBenchmark.readLongs                          4           LE  
> thrpt    5  29.925 ± 0.718  ops/us
> PackedIntsDecodeBenchmark.readLongs                          4           BE  
> thrpt    5  15.930 ± 0.350  ops/us
> PackedIntsDecodeBenchmark.readLongs                          5           LE  
> thrpt    5  29.773 ± 0.979  ops/us
> PackedIntsDecodeBenchmark.readLongs                          5           BE  
> thrpt    5  15.775 ± 0.257  ops/us
> PackedIntsDecodeBenchmark.readLongs                          6           LE  
> thrpt    5  29.591 ± 1.285  ops/us
> PackedIntsDecodeBenchmark.readLongs                          6           BE  
> thrpt    5  15.732 ± 0.226  ops/us
> PackedIntsDecodeBenchmark.readLongs                          7           LE  
> thrpt    5  29.708 ± 0.909  ops/us
> PackedIntsDecodeBenchmark.readLongs                          7           BE  
> thrpt    5  15.433 ± 0.562  ops/us
> PackedIntsDecodeBenchmark.readLongs                          8           LE  
> thrpt    5  29.828 ± 0.689  ops/us
> PackedIntsDecodeBenchmark.readLongs                          8           BE  
> thrpt    5  15.390 ± 0.188  ops/us
> PackedIntsDecodeBenchmark.readLongs                          9           LE  
> thrpt    5  29.127 ± 0.309  ops/us
> PackedIntsDecodeBenchmark.readLongs                          9           BE  
> thrpt    5  15.180 ± 0.199  ops/us
> PackedIntsDecodeBenchmark.readLongs                         10           LE  
> thrpt    5  29.085 ± 0.538  ops/us
> PackedIntsDecodeBenchmark.readLongs                         10           BE  
> thrpt    5  14.887 ± 1.687  ops/us
> PackedIntsDecodeBenchmark.readLongs                         11           LE  
> thrpt    5  28.904 ± 0.329  ops/us
> PackedIntsDecodeBenchmark.readLongs                         11           BE  
> thrpt    5  14.936 ± 0.119  ops/us
> PackedIntsDecodeBenchmark.readLongs                         12           LE  
> thrpt    5  29.025 ± 0.299  ops/us
> PackedIntsDecodeBenchmark.readLongs                         12           BE  
> thrpt    5  14.685 ± 0.154  ops/us
> PackedIntsDecodeBenchmark.readLongs                         13           LE  
> thrpt    5  28.963 ± 0.244  ops/us
> PackedIntsDecodeBenchmark.readLongs                         13           BE  
> thrpt    5  14.569 ± 0.100  ops/us
> PackedIntsDecodeBenchmark.readLongs                         14           LE  
> thrpt    5  28.584 ± 1.409  ops/us
> PackedIntsDecodeBenchmark.readLongs                         14           BE  
> thrpt    5  14.340 ± 0.594  ops/us
> PackedIntsDecodeBenchmark.readLongs                         15           LE  
> thrpt    5  28.744 ± 0.314  ops/us
> PackedIntsDecodeBenchmark.readLongs                         15           BE  
> thrpt    5  14.222 ± 0.105  ops/us
> PackedIntsDecodeBenchmark.readLongs                         16           LE  
> thrpt    5  26.638 ± 0.452  ops/us
> PackedIntsDecodeBenchmark.readLongs                         16           BE  
> thrpt    5  13.906 ± 0.604  ops/us
> {noformat}
> The thing that is a bit frustrating is that the best throughputs are obtained 
> on a ByteBuffer that is configured to use the little endian byte order (which 
> is the native byte order of my machine) while Java/Lucene default to big 
> endian. So if we want that kind of throughput we'll probably need to add ways 
> to read data in the native byte order in the IndexInput API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

Reply via email to