Baffling performance regression measured by luceneutil

Greg Miller Thu, 15 Aug 2024 16:22:48 -0700

Hi folks-

Egor Potemkin and I have been digging into a baffling performance
regression we're seeing in response to a one-line change that doesn't
rationally seem like it should have any performance impact what-so-ever.
There's more background on why we're trying to understand this, but I'll
save the broader context for now and just focus on the confusing issue
we're trying to understand.


Inside IndexSearcher, we've staged a change that initializes an ArrayList
of Collectors slightly earlier than what we do today (see:
https://github.com/apache/lucene/pull/13657/files). We end up with code
that looks like this (note the isolated line that's initializing
`collectors`):

```
  public <C extends Collector, T> T search(Query query, CollectorManager<C,
T> collectorManager)
      throws IOException {
    final LeafSlice[] leafSlices = getSlices();
    final C firstCollector = collectorManager.newCollector();
    query = rewrite(query, firstCollector.scoreMode().needsScores());
    final Weight weight = createWeight(query, firstCollector.scoreMode(),
1);

    final List<C> collectors = new ArrayList<>(leafSlices.length);

    return search(weight, collectorManager, firstCollector, collectors,
leafSlices);
  }
```

What's baffling is that if we initialize the `collectors` list _after_ the
call to `createWeight` (as shown here), there's no performance impact at
all (as expected). But if all we do is initialize `collectors` _before_ the
call to `createWeight`, we see a very significant regression on LowTerm,
MedTerm, HighTerm tasks in luceneutil (e.g., %15 - 30%). At the other end,
we see a significant improvement to OrHighNotLow, OrHighNotMed,
OrHighNotHigh (e.g., 7% - 15%). (This is running wikimedium10m on an
x86-based AWS ec2 host, but results reproduced separately for Egor and in
our nightly benchmark runs; full luceneutil output at the bottom of this
email [1]). Some additional context and conversation is captured in this
"demo" PR: https://github.com/apache/lucene/pull/13657.

My only hunch here is this has something to do with hotspot's decision
making or some other such runtime optimization, but I'm getting out of my
depth and hoping someone in this community will have ideas on ways to
continue this investigation. Anyone have a clue what might be going on? Or
any suggestions on other things to look at? This isn't a purely academic
exercise for what it's worth. This oddity has caused us to duplicate some
code in IndexSearcher to work with a new sandbox faceting module, so it
would be nice to figure this out so we can remove the code duplication.
(The code duplication is pretty minor, but it's still really frustrating
and it's a trap waiting to be hit by someone in the future that tries to
consolidate the code duplication and runs into this)

Thanks for reading, and thanks in advance for any ideas!

Cheers,
-Greg


[1] Full Lucene util output:
```
                            TaskQPS baseline      StdDevQPS
my_modified_version      StdDev                Pct diff p-value
                         MedTerm      513.21      (4.9%)      369.43
 (4.8%)  -28.0% ( -35% -  -19%) 0.000
                        HighTerm      523.20      (6.9%)      402.11
 (5.0%)  -23.1% ( -32% -  -12%) 0.000
                         LowTerm      837.70      (3.9%)      715.94
 (3.9%)  -14.5% ( -21% -   -6%) 0.000
       BrowseDayOfYearSSDVFacets       11.97     (18.9%)       11.31
(11.9%)   -5.5% ( -30% -   31%) 0.273
            MedTermDayTaxoFacets       23.03      (4.9%)       21.95
 (6.4%)   -4.7% ( -15% -    6%) 0.009
                      HighPhrase      143.93      (8.3%)      139.35
 (4.7%)   -3.2% ( -14% -   10%) 0.136
                          Fuzzy2       53.03      (9.0%)       51.50
 (7.3%)   -2.9% ( -17% -   14%) 0.265
                     MedSpanNear       50.70      (5.1%)       49.26
 (3.0%)   -2.8% ( -10% -    5%) 0.032
                       LowPhrase       70.38      (4.9%)       68.60
 (5.3%)   -2.5% ( -12% -    8%) 0.118
                       MedPhrase       88.15      (5.2%)       86.03
 (4.2%)   -2.4% ( -11% -    7%) 0.105
          OrHighMedDayTaxoFacets        7.01      (5.5%)        6.86
 (5.4%)   -2.0% ( -12% -    9%) 0.237
                    HighSpanNear       28.95      (2.7%)       28.42
 (2.9%)   -1.8% (  -7% -    3%) 0.043
                 MedSloppyPhrase      201.71      (3.3%)      198.58
 (3.1%)   -1.6% (  -7% -    4%) 0.124
            BrowseDateTaxoFacets       23.97     (28.7%)       23.62
(22.8%)   -1.5% ( -41% -   70%) 0.858
         AndHighMedDayTaxoFacets       32.81      (5.8%)       32.35
 (7.1%)   -1.4% ( -13% -   12%) 0.493
        AndHighHighDayTaxoFacets       27.86      (6.1%)       27.50
 (6.5%)   -1.3% ( -13% -   12%) 0.507
                 LowSloppyPhrase      149.20      (2.9%)      147.50
 (3.0%)   -1.1% (  -6% -    4%) 0.227
            HighTermTitleBDVSort       66.72      (6.6%)       66.04
 (5.7%)   -1.0% ( -12% -   12%) 0.604
                     AndHighHigh      187.45      (7.4%)      185.75
 (6.7%)   -0.9% ( -13% -   14%) 0.684
                     LowSpanNear      102.21      (2.1%)      101.50
 (1.5%)   -0.7% (  -4% -    3%) 0.242
                      OrHighHigh      218.06      (6.3%)      216.74
 (4.1%)   -0.6% ( -10% -   10%) 0.721
               HighTermTitleSort      132.14      (1.5%)      131.93
 (1.3%)   -0.2% (  -2% -    2%) 0.724
                HighSloppyPhrase       31.43      (5.4%)       31.39
 (6.6%)   -0.1% ( -11% -   12%) 0.949
     BrowseRandomLabelSSDVFacets        7.91     (10.2%)        7.91
(11.2%)   -0.0% ( -19% -   23%) 0.992
                      AndHighMed      288.24      (4.9%)      288.33
 (4.0%)    0.0% (  -8% -    9%) 0.982
                      AndHighLow     1339.09      (3.2%)     1345.87
 (4.8%)    0.5% (  -7% -    8%) 0.694
                       OrHighMed      473.22      (3.9%)      476.21
 (3.8%)    0.6% (  -6% -    8%) 0.603
       BrowseDayOfYearTaxoFacets       23.67     (28.7%)       23.82
(23.5%)    0.6% ( -40% -   74%) 0.938
           HighTermDayOfYearSort      415.29      (5.2%)      418.26
 (5.9%)    0.7% (  -9% -   12%) 0.684
            BrowseDateSSDVFacets        2.14     (21.4%)        2.16
(22.4%)    1.0% ( -35% -   56%) 0.887
                        Wildcard      489.21      (4.3%)      494.69
 (4.5%)    1.1% (  -7% -   10%) 0.420
                      TermDTSort      216.56      (5.9%)      219.04
 (4.8%)    1.1% (  -8% -   12%) 0.499
                        PKLookup      139.24      (8.7%)      140.89
(10.8%)    1.2% ( -16% -   22%) 0.703
                          Fuzzy1       74.44      (9.7%)       75.42
 (8.3%)    1.3% ( -15% -   21%) 0.643
                         Respell       48.52      (7.2%)       49.20
 (6.6%)    1.4% ( -11% -   16%) 0.519
                    OrNotHighLow     1260.39      (3.0%)     1279.03
 (2.7%)    1.5% (  -4% -    7%) 0.101
             MedIntervalsOrdered      132.03      (9.2%)      134.25
(12.6%)    1.7% ( -18% -   25%) 0.630
           BrowseMonthTaxoFacets       24.51     (26.9%)       25.02
(25.5%)    2.0% ( -39% -   74%) 0.804
               HighTermMonthSort     1117.15      (4.1%)     1143.38
 (4.6%)    2.3% (  -6% -   11%) 0.090
     BrowseRandomLabelTaxoFacets       15.54     (25.0%)       15.93
(19.7%)    2.5% ( -33% -   62%) 0.724
                         Prefix3      667.73     (11.1%)      684.51
(11.1%)    2.5% ( -17% -   27%) 0.474
             LowIntervalsOrdered      118.38     (14.5%)      121.55
(14.8%)    2.7% ( -23% -   37%) 0.564
            HighIntervalsOrdered       30.52      (9.2%)       31.34
 (7.0%)    2.7% ( -12% -   20%) 0.298
                    OrNotHighMed      365.66      (5.9%)      376.73
 (6.1%)    3.0% (  -8% -   15%) 0.110
                       OrHighLow      586.67      (5.7%)      608.48
 (5.6%)    3.7% (  -7% -   15%) 0.037
                   OrNotHighHigh      257.09      (5.8%)      267.66
 (6.5%)    4.1% (  -7% -   17%) 0.034
           BrowseMonthSSDVFacets       11.21      (9.1%)       11.69
 (7.1%)    4.3% ( -11% -   22%) 0.100
                    OrHighNotLow      446.78      (8.7%)      479.82
 (7.1%)    7.4% (  -7% -   25%) 0.003
                    OrHighNotMed      591.66      (7.6%)      649.35
 (4.8%)    9.8% (  -2% -   23%) 0.000
                          IntNRQ      202.12     (17.5%)      224.77
(28.1%)   11.2% ( -29% -   68%) 0.130
                   OrHighNotHigh      339.78      (8.0%)      393.02
 (6.7%)   15.7% (   0% -   33%) 0.000
```

Baffling performance regression measured by luceneutil

Reply via email to