gf2121 opened a new pull request, #14935: URL: https://github.com/apache/lucene/pull/14935
This comes from https://github.com/apache/lucene/pull/14910#issuecomment-3059176116. JMH shows: 1. `whileLoop`, `forloop`, `forLoopManualUnrolling` are having even performance, `whileLoop` is even slightly better. So in #14910, the speed-up i see (when vector module disabled or at mac m2) may come from some where else, like the consumer abstraction? JMH results can sometimes be not consistent with luceneutil as well. So it is hard to say. ``` BitsetToArrayBenchmark.whileLoop 5 thrpt 5 29.180 ± 0.568 ops/us BitsetToArrayBenchmark.whileLoop 10 thrpt 5 23.460 ± 0.120 ops/us BitsetToArrayBenchmark.whileLoop 20 thrpt 5 16.722 ± 0.330 ops/us BitsetToArrayBenchmark.whileLoop 30 thrpt 5 13.006 ± 0.292 ops/us BitsetToArrayBenchmark.whileLoop 40 thrpt 5 10.678 ± 0.195 ops/us BitsetToArrayBenchmark.whileLoop 50 thrpt 5 9.035 ± 0.179 ops/us BitsetToArrayBenchmark.whileLoop 60 thrpt 5 7.834 ± 0.060 ops/us BitsetToArrayBenchmark.forLoop 5 thrpt 5 26.207 ± 0.642 ops/us BitsetToArrayBenchmark.forLoop 10 thrpt 5 21.541 ± 0.281 ops/us BitsetToArrayBenchmark.forLoop 20 thrpt 5 15.884 ± 0.310 ops/us BitsetToArrayBenchmark.forLoop 30 thrpt 5 12.666 ± 0.013 ops/us BitsetToArrayBenchmark.forLoop 40 thrpt 5 10.412 ± 0.180 ops/us BitsetToArrayBenchmark.forLoop 50 thrpt 5 8.881 ± 0.091 ops/us BitsetToArrayBenchmark.forLoop 60 thrpt 5 7.709 ± 0.222 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 5 thrpt 5 25.848 ± 0.492 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 10 thrpt 5 21.263 ± 0.354 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 20 thrpt 5 15.842 ± 0.315 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 30 thrpt 5 12.641 ± 0.053 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 40 thrpt 5 10.489 ± 0.073 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 50 thrpt 5 8.930 ± 0.178 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 60 thrpt 5 7.816 ± 0.018 ops/us ``` 2. > By the way, we may want to look into other approaches for the scalar case. Since we only use bit sets in postings when many bits would be set, a linear scan should perform quite efficiently? Inspired by this comment, i tried some dense optimization ways, as you can see `denseXXX` in jmh. The fastest is `denseBranchlessUnrolling` but i liked `denseBranchlessVectorized` better since it has less code. ``` Benchmark (bitCount) Mode Cnt Score Error Units BitsetToArrayBenchmark.dense 5 thrpt 5 9.668 ± 0.054 ops/us BitsetToArrayBenchmark.dense 10 thrpt 5 6.949 ± 0.068 ops/us BitsetToArrayBenchmark.dense 20 thrpt 5 4.607 ± 0.057 ops/us BitsetToArrayBenchmark.dense 30 thrpt 5 3.432 ± 0.037 ops/us BitsetToArrayBenchmark.dense 40 thrpt 5 3.759 ± 0.036 ops/us BitsetToArrayBenchmark.dense 50 thrpt 5 5.310 ± 0.016 ops/us BitsetToArrayBenchmark.dense 60 thrpt 5 9.039 ± 0.240 ops/us BitsetToArrayBenchmark.denseBranchLess 5 thrpt 5 13.464 ± 0.446 ops/us BitsetToArrayBenchmark.denseBranchLess 10 thrpt 5 13.547 ± 0.250 ops/us BitsetToArrayBenchmark.denseBranchLess 20 thrpt 5 13.531 ± 0.209 ops/us BitsetToArrayBenchmark.denseBranchLess 30 thrpt 5 13.534 ± 0.336 ops/us BitsetToArrayBenchmark.denseBranchLess 40 thrpt 5 13.530 ± 0.319 ops/us BitsetToArrayBenchmark.denseBranchLess 50 thrpt 5 13.515 ± 0.330 ops/us BitsetToArrayBenchmark.denseBranchLess 60 thrpt 5 13.526 ± 0.067 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 5 thrpt 5 15.753 ± 0.262 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 10 thrpt 5 15.709 ± 0.214 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 20 thrpt 5 15.811 ± 0.334 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 30 thrpt 5 15.752 ± 0.444 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 40 thrpt 5 15.861 ± 0.074 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 50 thrpt 5 15.630 ± 0.052 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 60 thrpt 5 15.789 ± 0.682 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 5 thrpt 5 14.884 ± 0.212 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 10 thrpt 5 14.931 ± 0.328 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 20 thrpt 5 14.953 ± 0.050 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 30 thrpt 5 15.011 ± 0.328 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 40 thrpt 5 14.961 ± 0.394 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 50 thrpt 5 14.927 ± 0.323 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 60 thrpt 5 14.924 ± 0.286 ops/us BitsetToArrayBenchmark.denseInvert 5 thrpt 5 22.539 ± 0.523 ops/us BitsetToArrayBenchmark.denseInvert 10 thrpt 5 16.538 ± 0.298 ops/us BitsetToArrayBenchmark.denseInvert 20 thrpt 5 12.110 ± 0.228 ops/us BitsetToArrayBenchmark.denseInvert 30 thrpt 5 10.344 ± 0.195 ops/us BitsetToArrayBenchmark.denseInvert 40 thrpt 5 9.934 ± 0.201 ops/us BitsetToArrayBenchmark.denseInvert 50 thrpt 5 10.192 ± 0.309 ops/us BitsetToArrayBenchmark.denseInvert 60 thrpt 5 11.114 ± 0.387 ops/us BitsetToArrayBenchmark.hybrid 5 thrpt 5 25.812 ± 0.503 ops/us BitsetToArrayBenchmark.hybrid 10 thrpt 5 21.618 ± 0.062 ops/us BitsetToArrayBenchmark.hybrid 20 thrpt 5 15.988 ± 0.018 ops/us BitsetToArrayBenchmark.hybrid 30 thrpt 5 12.660 ± 0.027 ops/us BitsetToArrayBenchmark.hybrid 40 thrpt 5 14.960 ± 0.201 ops/us BitsetToArrayBenchmark.hybrid 50 thrpt 5 14.907 ± 0.407 ops/us BitsetToArrayBenchmark.hybrid 60 thrpt 5 14.933 ± 0.364 ops/us ``` Benefited from this dense optimization, the luceneutil result seems even better than the AVX512 patch, so we may not need that any more! ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value TermTitleSort 56.43 (3.2%) 54.92 (3.3%) -2.7% ( -8% - 3%) 0.030 AndHighOrMedMed 21.09 (1.7%) 20.91 (1.3%) -0.9% ( -3% - 2%) 0.140 CountAndHighMed 63.47 (2.6%) 63.16 (3.5%) -0.5% ( -6% - 5%) 0.672 TermDayOfYearSort 302.92 (1.2%) 301.46 (4.3%) -0.5% ( -5% - 5%) 0.685 PKLookup 76.25 (2.4%) 76.01 (1.4%) -0.3% ( -3% - 3%) 0.671 Fuzzy1 32.73 (2.4%) 32.66 (3.3%) -0.2% ( -5% - 5%) 0.849 Fuzzy2 30.22 (1.7%) 30.20 (2.7%) -0.1% ( -4% - 4%) 0.929 CombinedTerm 16.35 (2.1%) 16.34 (1.3%) -0.0% ( -3% - 3%) 0.976 Respell 27.75 (1.9%) 27.77 (2.7%) 0.1% ( -4% - 4%) 0.947 FilteredTerm 64.97 (1.9%) 65.02 (2.5%) 0.1% ( -4% - 4%) 0.920 CountFilteredPhrase 11.49 (1.1%) 11.51 (1.2%) 0.1% ( -2% - 2%) 0.820 Wildcard 48.76 (3.1%) 48.81 (3.1%) 0.1% ( -5% - 6%) 0.923 CountFilteredOrHighHigh 25.92 (0.8%) 25.96 (1.1%) 0.1% ( -1% - 2%) 0.685 Term 396.14 (5.7%) 396.74 (5.4%) 0.2% ( -10% - 11%) 0.943 IntNRQ 55.37 (2.1%) 55.48 (2.4%) 0.2% ( -4% - 4%) 0.815 TermMonthSort 1158.39 (3.1%) 1160.79 (5.3%) 0.2% ( -7% - 8%) 0.899 CountAndHighHigh 54.59 (1.5%) 54.72 (1.9%) 0.2% ( -3% - 3%) 0.713 FilteredOrHighHigh 16.18 (1.8%) 16.23 (2.9%) 0.3% ( -4% - 5%) 0.735 IntSet 149.94 (4.3%) 150.42 (4.7%) 0.3% ( -8% - 9%) 0.851 CountFilteredOrHighMed 31.52 (0.9%) 31.62 (1.3%) 0.3% ( -1% - 2%) 0.462 FilteredAndStopWords 12.58 (1.9%) 12.62 (1.9%) 0.4% ( -3% - 4%) 0.602 FilteredIntNRQ 54.87 (1.1%) 55.09 (1.9%) 0.4% ( -2% - 3%) 0.503 CountFilteredIntNRQ 24.33 (1.2%) 24.44 (1.5%) 0.4% ( -2% - 3%) 0.403 FilteredAndHighHigh 14.57 (1.8%) 14.64 (2.0%) 0.5% ( -3% - 4%) 0.513 FilteredPhrase 11.89 (2.4%) 11.95 (3.0%) 0.5% ( -4% - 6%) 0.624 Prefix3 79.10 (3.9%) 79.51 (4.5%) 0.5% ( -7% - 9%) 0.743 FilteredOrHighMed 49.42 (1.8%) 49.69 (3.2%) 0.5% ( -4% - 5%) 0.586 CombinedAndHighMed 28.15 (1.8%) 28.31 (2.7%) 0.6% ( -3% - 5%) 0.514 FilteredOr2Terms2StopWords 57.94 (1.9%) 58.27 (2.8%) 0.6% ( -3% - 5%) 0.520 FilteredPrefix3 73.30 (4.3%) 73.76 (4.8%) 0.6% ( -8% - 10%) 0.720 Phrase 9.22 (2.2%) 9.28 (2.9%) 0.6% ( -4% - 5%) 0.517 CountOrHighMed 83.12 (2.1%) 83.68 (2.4%) 0.7% ( -3% - 5%) 0.431 CountOrHighHigh 55.51 (0.9%) 55.89 (1.7%) 0.7% ( -1% - 3%) 0.173 FilteredOrStopWords 11.10 (2.1%) 11.18 (2.7%) 0.7% ( -3% - 5%) 0.426 FilteredAnd3Terms 80.17 (1.4%) 80.77 (2.9%) 0.7% ( -3% - 5%) 0.383 AndMedOrHighHigh 19.29 (3.1%) 19.43 (3.5%) 0.8% ( -5% - 7%) 0.545 CountTerm 2854.29 (2.9%) 2876.79 (3.9%) 0.8% ( -5% - 7%) 0.545 DismaxTerm 321.50 (4.0%) 324.09 (3.6%) 0.8% ( -6% - 8%) 0.575 FilteredOr3Terms 50.89 (1.3%) 51.31 (2.9%) 0.8% ( -3% - 5%) 0.327 CombinedOrHighMed 27.67 (1.7%) 27.93 (2.4%) 0.9% ( -3% - 5%) 0.238 OrHighMed 78.17 (8.1%) 78.97 (8.1%) 1.0% ( -13% - 18%) 0.737 DismaxOrHighMed 52.90 (4.7%) 53.87 (5.0%) 1.8% ( -7% - 12%) 0.317 TermDTSort 184.12 (1.8%) 187.60 (4.0%) 1.9% ( -3% - 7%) 0.109 FilteredAndHighMed 42.42 (2.3%) 43.24 (3.4%) 1.9% ( -3% - 7%) 0.084 AndHighMed 59.49 (6.4%) 60.73 (8.2%) 2.1% ( -11% - 17%) 0.452 OrHighRare 113.84 (5.1%) 116.29 (5.9%) 2.1% ( -8% - 13%) 0.303 And2Terms2StopWords 60.52 (4.4%) 62.04 (4.9%) 2.5% ( -6% - 12%) 0.154 DismaxOrHighHigh 37.38 (5.3%) 38.37 (5.7%) 2.7% ( -7% - 14%) 0.201 FilteredAnd2Terms2StopWords 61.89 (2.5%) 63.59 (3.8%) 2.7% ( -3% - 9%) 0.025 CombinedAndHighHigh 7.77 (1.9%) 7.99 (2.0%) 2.8% ( -1% - 6%) 0.000 CombinedOrHighHigh 7.66 (1.9%) 7.90 (2.3%) 3.1% ( -1% - 7%) 0.000 Or2Terms2StopWords 62.35 (4.3%) 64.34 (5.4%) 3.2% ( -6% - 13%) 0.086 AndHighHigh 26.50 (9.3%) 28.10 (10.7%) 6.0% ( -12% - 28%) 0.112 OrHighHigh 26.64 (8.6%) 28.27 (10.4%) 6.1% ( -11% - 27%) 0.089 Or3Terms 64.65 (4.8%) 68.88 (6.5%) 6.5% ( -4% - 18%) 0.002 And3Terms 69.90 (4.6%) 74.59 (6.7%) 6.7% ( -4% - 18%) 0.002 AndStopWords 10.30 (5.8%) 11.89 (9.2%) 15.4% ( 0% - 32%) 0.000 OrStopWords 10.83 (7.6%) 12.57 (10.1%) 16.0% ( -1% - 36%) 0.000 ``` Another thing makes me not happy is the duplicate word-level code with `FixedBitSet#forEach`, but i have not had a good idea to reuse it for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org