[PR] Optimize bitset to array [lucene]

via GitHub Fri, 11 Jul 2025 03:19:50 -0700


gf2121 opened a new pull request, #14935:
URL: https://github.com/apache/lucene/pull/14935


   This comes from 
https://github.com/apache/lucene/pull/14910#issuecomment-3059176116.
   
   JMH shows:
   
   1. `whileLoop`, `forloop`, `forLoopManualUnrolling` are having even 
performance, `whileLoop` is even slightly better. So in #14910, the speed-up i 
see (when vector module disabled or at mac m2) may come from some where else, 
like the consumer abstraction? JMH results can sometimes be not consistent with 
luceneutil as well. So it is hard to say.
   
   ```
   BitsetToArrayBenchmark.whileLoop                           5  thrpt    5  
29.180 ± 0.568  ops/us
   BitsetToArrayBenchmark.whileLoop                          10  thrpt    5  
23.460 ± 0.120  ops/us
   BitsetToArrayBenchmark.whileLoop                          20  thrpt    5  
16.722 ± 0.330  ops/us
   BitsetToArrayBenchmark.whileLoop                          30  thrpt    5  
13.006 ± 0.292  ops/us
   BitsetToArrayBenchmark.whileLoop                          40  thrpt    5  
10.678 ± 0.195  ops/us
   BitsetToArrayBenchmark.whileLoop                          50  thrpt    5   
9.035 ± 0.179  ops/us
   BitsetToArrayBenchmark.whileLoop                          60  thrpt    5   
7.834 ± 0.060  ops/us
   BitsetToArrayBenchmark.forLoop                             5  thrpt    5  
26.207 ± 0.642  ops/us
   BitsetToArrayBenchmark.forLoop                            10  thrpt    5  
21.541 ± 0.281  ops/us
   BitsetToArrayBenchmark.forLoop                            20  thrpt    5  
15.884 ± 0.310  ops/us
   BitsetToArrayBenchmark.forLoop                            30  thrpt    5  
12.666 ± 0.013  ops/us
   BitsetToArrayBenchmark.forLoop                            40  thrpt    5  
10.412 ± 0.180  ops/us
   BitsetToArrayBenchmark.forLoop                            50  thrpt    5   
8.881 ± 0.091  ops/us
   BitsetToArrayBenchmark.forLoop                            60  thrpt    5   
7.709 ± 0.222  ops/us
   BitsetToArrayBenchmark.forLoopManualUnrolling              5  thrpt    5  
25.848 ± 0.492  ops/us
   BitsetToArrayBenchmark.forLoopManualUnrolling             10  thrpt    5  
21.263 ± 0.354  ops/us
   BitsetToArrayBenchmark.forLoopManualUnrolling             20  thrpt    5  
15.842 ± 0.315  ops/us
   BitsetToArrayBenchmark.forLoopManualUnrolling             30  thrpt    5  
12.641 ± 0.053  ops/us
   BitsetToArrayBenchmark.forLoopManualUnrolling             40  thrpt    5  
10.489 ± 0.073  ops/us
   BitsetToArrayBenchmark.forLoopManualUnrolling             50  thrpt    5   
8.930 ± 0.178  ops/us
   BitsetToArrayBenchmark.forLoopManualUnrolling             60  thrpt    5   
7.816 ± 0.018  ops/us
   ```
   
   2. 
   > By the way, we may want to look into other approaches for the scalar case. 
Since we only use bit sets in postings when many bits would be set, a linear 
scan should perform quite efficiently?
   
   Inspired by this comment, i tried some dense optimization ways, as you can 
see `denseXXX` in jmh. The fastest is `denseBranchlessUnrolling` but i liked 
`denseBranchlessVectorized` better since it has less code.
   
   ```
   Benchmark                                         (bitCount)   Mode  Cnt   
Score   Error   Units
   BitsetToArrayBenchmark.dense                               5  thrpt    5   
9.668 ± 0.054  ops/us
   BitsetToArrayBenchmark.dense                              10  thrpt    5   
6.949 ± 0.068  ops/us
   BitsetToArrayBenchmark.dense                              20  thrpt    5   
4.607 ± 0.057  ops/us
   BitsetToArrayBenchmark.dense                              30  thrpt    5   
3.432 ± 0.037  ops/us
   BitsetToArrayBenchmark.dense                              40  thrpt    5   
3.759 ± 0.036  ops/us
   BitsetToArrayBenchmark.dense                              50  thrpt    5   
5.310 ± 0.016  ops/us
   BitsetToArrayBenchmark.dense                              60  thrpt    5   
9.039 ± 0.240  ops/us
   BitsetToArrayBenchmark.denseBranchLess                     5  thrpt    5  
13.464 ± 0.446  ops/us
   BitsetToArrayBenchmark.denseBranchLess                    10  thrpt    5  
13.547 ± 0.250  ops/us
   BitsetToArrayBenchmark.denseBranchLess                    20  thrpt    5  
13.531 ± 0.209  ops/us
   BitsetToArrayBenchmark.denseBranchLess                    30  thrpt    5  
13.534 ± 0.336  ops/us
   BitsetToArrayBenchmark.denseBranchLess                    40  thrpt    5  
13.530 ± 0.319  ops/us
   BitsetToArrayBenchmark.denseBranchLess                    50  thrpt    5  
13.515 ± 0.330  ops/us
   BitsetToArrayBenchmark.denseBranchLess                    60  thrpt    5  
13.526 ± 0.067  ops/us
   BitsetToArrayBenchmark.denseBranchLessUnrolling            5  thrpt    5  
15.753 ± 0.262  ops/us
   BitsetToArrayBenchmark.denseBranchLessUnrolling           10  thrpt    5  
15.709 ± 0.214  ops/us
   BitsetToArrayBenchmark.denseBranchLessUnrolling           20  thrpt    5  
15.811 ± 0.334  ops/us
   BitsetToArrayBenchmark.denseBranchLessUnrolling           30  thrpt    5  
15.752 ± 0.444  ops/us
   BitsetToArrayBenchmark.denseBranchLessUnrolling           40  thrpt    5  
15.861 ± 0.074  ops/us
   BitsetToArrayBenchmark.denseBranchLessUnrolling           50  thrpt    5  
15.630 ± 0.052  ops/us
   BitsetToArrayBenchmark.denseBranchLessUnrolling           60  thrpt    5  
15.789 ± 0.682  ops/us
   BitsetToArrayBenchmark.denseBranchLessVectorized           5  thrpt    5  
14.884 ± 0.212  ops/us
   BitsetToArrayBenchmark.denseBranchLessVectorized          10  thrpt    5  
14.931 ± 0.328  ops/us
   BitsetToArrayBenchmark.denseBranchLessVectorized          20  thrpt    5  
14.953 ± 0.050  ops/us
   BitsetToArrayBenchmark.denseBranchLessVectorized          30  thrpt    5  
15.011 ± 0.328  ops/us
   BitsetToArrayBenchmark.denseBranchLessVectorized          40  thrpt    5  
14.961 ± 0.394  ops/us
   BitsetToArrayBenchmark.denseBranchLessVectorized          50  thrpt    5  
14.927 ± 0.323  ops/us
   BitsetToArrayBenchmark.denseBranchLessVectorized          60  thrpt    5  
14.924 ± 0.286  ops/us
   BitsetToArrayBenchmark.denseInvert                         5  thrpt    5  
22.539 ± 0.523  ops/us
   BitsetToArrayBenchmark.denseInvert                        10  thrpt    5  
16.538 ± 0.298  ops/us
   BitsetToArrayBenchmark.denseInvert                        20  thrpt    5  
12.110 ± 0.228  ops/us
   BitsetToArrayBenchmark.denseInvert                        30  thrpt    5  
10.344 ± 0.195  ops/us
   BitsetToArrayBenchmark.denseInvert                        40  thrpt    5   
9.934 ± 0.201  ops/us
   BitsetToArrayBenchmark.denseInvert                        50  thrpt    5  
10.192 ± 0.309  ops/us
   BitsetToArrayBenchmark.denseInvert                        60  thrpt    5  
11.114 ± 0.387  ops/us
   BitsetToArrayBenchmark.hybrid                              5  thrpt    5  
25.812 ± 0.503  ops/us
   BitsetToArrayBenchmark.hybrid                             10  thrpt    5  
21.618 ± 0.062  ops/us
   BitsetToArrayBenchmark.hybrid                             20  thrpt    5  
15.988 ± 0.018  ops/us
   BitsetToArrayBenchmark.hybrid                             30  thrpt    5  
12.660 ± 0.027  ops/us
   BitsetToArrayBenchmark.hybrid                             40  thrpt    5  
14.960 ± 0.201  ops/us
   BitsetToArrayBenchmark.hybrid                             50  thrpt    5  
14.907 ± 0.407  ops/us
   BitsetToArrayBenchmark.hybrid                             60  thrpt    5  
14.933 ± 0.364  ops/us
   ```
   
   Benefited from this dense optimization, the luceneutil result seems even 
better than the AVX512 patch, so we may not need that any more!
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                      TermTitleSort       56.43      (3.2%)       54.92      
(3.3%)   -2.7% (  -8% -    3%) 0.030
                    AndHighOrMedMed       21.09      (1.7%)       20.91      
(1.3%)   -0.9% (  -3% -    2%) 0.140
                    CountAndHighMed       63.47      (2.6%)       63.16      
(3.5%)   -0.5% (  -6% -    5%) 0.672
                  TermDayOfYearSort      302.92      (1.2%)      301.46      
(4.3%)   -0.5% (  -5% -    5%) 0.685
                           PKLookup       76.25      (2.4%)       76.01      
(1.4%)   -0.3% (  -3% -    3%) 0.671
                             Fuzzy1       32.73      (2.4%)       32.66      
(3.3%)   -0.2% (  -5% -    5%) 0.849
                             Fuzzy2       30.22      (1.7%)       30.20      
(2.7%)   -0.1% (  -4% -    4%) 0.929
                       CombinedTerm       16.35      (2.1%)       16.34      
(1.3%)   -0.0% (  -3% -    3%) 0.976
                            Respell       27.75      (1.9%)       27.77      
(2.7%)    0.1% (  -4% -    4%) 0.947
                       FilteredTerm       64.97      (1.9%)       65.02      
(2.5%)    0.1% (  -4% -    4%) 0.920
                CountFilteredPhrase       11.49      (1.1%)       11.51      
(1.2%)    0.1% (  -2% -    2%) 0.820
                           Wildcard       48.76      (3.1%)       48.81      
(3.1%)    0.1% (  -5% -    6%) 0.923
            CountFilteredOrHighHigh       25.92      (0.8%)       25.96      
(1.1%)    0.1% (  -1% -    2%) 0.685
                               Term      396.14      (5.7%)      396.74      
(5.4%)    0.2% ( -10% -   11%) 0.943
                             IntNRQ       55.37      (2.1%)       55.48      
(2.4%)    0.2% (  -4% -    4%) 0.815
                      TermMonthSort     1158.39      (3.1%)     1160.79      
(5.3%)    0.2% (  -7% -    8%) 0.899
                   CountAndHighHigh       54.59      (1.5%)       54.72      
(1.9%)    0.2% (  -3% -    3%) 0.713
                 FilteredOrHighHigh       16.18      (1.8%)       16.23      
(2.9%)    0.3% (  -4% -    5%) 0.735
                             IntSet      149.94      (4.3%)      150.42      
(4.7%)    0.3% (  -8% -    9%) 0.851
             CountFilteredOrHighMed       31.52      (0.9%)       31.62      
(1.3%)    0.3% (  -1% -    2%) 0.462
               FilteredAndStopWords       12.58      (1.9%)       12.62      
(1.9%)    0.4% (  -3% -    4%) 0.602
                     FilteredIntNRQ       54.87      (1.1%)       55.09      
(1.9%)    0.4% (  -2% -    3%) 0.503
                CountFilteredIntNRQ       24.33      (1.2%)       24.44      
(1.5%)    0.4% (  -2% -    3%) 0.403
                FilteredAndHighHigh       14.57      (1.8%)       14.64      
(2.0%)    0.5% (  -3% -    4%) 0.513
                     FilteredPhrase       11.89      (2.4%)       11.95      
(3.0%)    0.5% (  -4% -    6%) 0.624
                            Prefix3       79.10      (3.9%)       79.51      
(4.5%)    0.5% (  -7% -    9%) 0.743
                  FilteredOrHighMed       49.42      (1.8%)       49.69      
(3.2%)    0.5% (  -4% -    5%) 0.586
                 CombinedAndHighMed       28.15      (1.8%)       28.31      
(2.7%)    0.6% (  -3% -    5%) 0.514
         FilteredOr2Terms2StopWords       57.94      (1.9%)       58.27      
(2.8%)    0.6% (  -3% -    5%) 0.520
                    FilteredPrefix3       73.30      (4.3%)       73.76      
(4.8%)    0.6% (  -8% -   10%) 0.720
                             Phrase        9.22      (2.2%)        9.28      
(2.9%)    0.6% (  -4% -    5%) 0.517
                     CountOrHighMed       83.12      (2.1%)       83.68      
(2.4%)    0.7% (  -3% -    5%) 0.431
                    CountOrHighHigh       55.51      (0.9%)       55.89      
(1.7%)    0.7% (  -1% -    3%) 0.173
                FilteredOrStopWords       11.10      (2.1%)       11.18      
(2.7%)    0.7% (  -3% -    5%) 0.426
                  FilteredAnd3Terms       80.17      (1.4%)       80.77      
(2.9%)    0.7% (  -3% -    5%) 0.383
                   AndMedOrHighHigh       19.29      (3.1%)       19.43      
(3.5%)    0.8% (  -5% -    7%) 0.545
                          CountTerm     2854.29      (2.9%)     2876.79      
(3.9%)    0.8% (  -5% -    7%) 0.545
                         DismaxTerm      321.50      (4.0%)      324.09      
(3.6%)    0.8% (  -6% -    8%) 0.575
                   FilteredOr3Terms       50.89      (1.3%)       51.31      
(2.9%)    0.8% (  -3% -    5%) 0.327
                  CombinedOrHighMed       27.67      (1.7%)       27.93      
(2.4%)    0.9% (  -3% -    5%) 0.238
                          OrHighMed       78.17      (8.1%)       78.97      
(8.1%)    1.0% ( -13% -   18%) 0.737
                    DismaxOrHighMed       52.90      (4.7%)       53.87      
(5.0%)    1.8% (  -7% -   12%) 0.317
                         TermDTSort      184.12      (1.8%)      187.60      
(4.0%)    1.9% (  -3% -    7%) 0.109
                 FilteredAndHighMed       42.42      (2.3%)       43.24      
(3.4%)    1.9% (  -3% -    7%) 0.084
                         AndHighMed       59.49      (6.4%)       60.73      
(8.2%)    2.1% ( -11% -   17%) 0.452
                         OrHighRare      113.84      (5.1%)      116.29      
(5.9%)    2.1% (  -8% -   13%) 0.303
                And2Terms2StopWords       60.52      (4.4%)       62.04      
(4.9%)    2.5% (  -6% -   12%) 0.154
                   DismaxOrHighHigh       37.38      (5.3%)       38.37      
(5.7%)    2.7% (  -7% -   14%) 0.201
        FilteredAnd2Terms2StopWords       61.89      (2.5%)       63.59      
(3.8%)    2.7% (  -3% -    9%) 0.025
                CombinedAndHighHigh        7.77      (1.9%)        7.99      
(2.0%)    2.8% (  -1% -    6%) 0.000
                 CombinedOrHighHigh        7.66      (1.9%)        7.90      
(2.3%)    3.1% (  -1% -    7%) 0.000
                 Or2Terms2StopWords       62.35      (4.3%)       64.34      
(5.4%)    3.2% (  -6% -   13%) 0.086
                        AndHighHigh       26.50      (9.3%)       28.10     
(10.7%)    6.0% ( -12% -   28%) 0.112
                         OrHighHigh       26.64      (8.6%)       28.27     
(10.4%)    6.1% ( -11% -   27%) 0.089
                           Or3Terms       64.65      (4.8%)       68.88      
(6.5%)    6.5% (  -4% -   18%) 0.002
                          And3Terms       69.90      (4.6%)       74.59      
(6.7%)    6.7% (  -4% -   18%) 0.002
                       AndStopWords       10.30      (5.8%)       11.89      
(9.2%)   15.4% (   0% -   32%) 0.000
                        OrStopWords       10.83      (7.6%)       12.57     
(10.1%)   16.0% (  -1% -   36%) 0.000
   ```
   
   Another thing makes me not happy is the duplicate word-level code with 
`FixedBitSet#forEach`, but i have not had a good idea to reuse it for now.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Optimize bitset to array [lucene]

Reply via email to