[
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483737#comment-16483737
]
Adrien Grand commented on LUCENE-8311:
--------------------------------------
Here is a patch that builds on LUCENE-8312 and the output of a luceneutil run:
{noformat}
LowPhrase 23.35 (2.1%) 16.05 (1.1%)
-31.3% ( -33% - -28%)
HighSloppyPhrase 26.90 (5.1%) 23.84 (3.8%)
-11.4% ( -19% - -2%)
HighTermMonthSort 155.27 (13.1%) 138.14 (11.0%)
-11.0% ( -31% - 15%)
MedSloppyPhrase 18.12 (4.6%) 16.20 (3.2%)
-10.6% ( -17% - -2%)
LowSloppyPhrase 236.36 (5.4%) 218.12 (4.5%)
-7.7% ( -16% - 2%)
HighTermDayOfYearSort 89.47 (11.5%) 84.16 (10.1%)
-5.9% ( -24% - 17%)
HighTerm 1463.31 (3.9%) 1402.12 (3.4%)
-4.2% ( -11% - 3%)
IntNRQ 29.88 (6.8%) 28.65 (6.8%)
-4.1% ( -16% - 10%)
MedTerm 1721.26 (3.8%) 1672.73 (3.2%)
-2.8% ( -9% - 4%)
Fuzzy2 112.51 (5.1%) 109.41 (4.9%)
-2.8% ( -12% - 7%)
LowTerm 2469.28 (3.8%) 2414.68 (3.5%)
-2.2% ( -9% - 5%)
MedSpanNear 85.48 (4.1%) 84.02 (3.9%)
-1.7% ( -9% - 6%)
HighSpanNear 10.03 (4.4%) 9.86 (4.1%)
-1.7% ( -9% - 7%)
Fuzzy1 153.76 (4.9%) 151.56 (4.0%)
-1.4% ( -9% - 7%)
OrHighHigh 20.38 (3.2%) 20.18 (3.0%)
-1.0% ( -6% - 5%)
OrHighMed 72.71 (2.5%) 72.05 (2.4%)
-0.9% ( -5% - 4%)
Respell 163.99 (2.1%) 162.75 (2.3%)
-0.8% ( -5% - 3%)
Wildcard 39.17 (5.7%) 38.90 (5.0%)
-0.7% ( -10% - 10%)
Prefix3 45.93 (7.2%) 45.72 (6.6%)
-0.5% ( -13% - 14%)
AndHighMed 147.08 (2.0%) 146.55 (3.1%)
-0.4% ( -5% - 4%)
AndHighHigh 52.33 (2.0%) 52.25 (3.6%)
-0.2% ( -5% - 5%)
OrHighLow 331.39 (3.4%) 334.43 (2.5%)
0.9% ( -4% - 7%)
AndHighLow 603.54 (3.6%) 611.77 (3.8%)
1.4% ( -5% - 9%)
LowSpanNear 7.87 (11.1%) 8.04 (6.9%)
2.2% ( -14% - 22%)
MedPhrase 94.59 (1.6%) 108.41 (1.9%)
14.6% ( 10% - 18%)
HighPhrase 11.74 (2.8%) 109.04 (24.6%)
828.7% ( 779% - 880%)
{noformat}
It helps HighPhrase a lot, but hurts LowPhrase a bit. More generally, this
change helps most when at least one of the searched terms mostly occurs within
the phrase. For instance "york" mostly appears in the "new york" phrase in the
wikipedia corpus that we use, so the "new york" phrase gets a huge speedup.
This is not the case for LowPhrase entries like "median age" or "his family",
which get worse latencies because they need to read impacts from the index and
compute score upper bounds.
I tried to implement impacts on sloppy phrases by summing up frequencies but it
didn't help since the score upper bounds were way higher than the scores that
were actually computed. The reason why they are slower according to luceneutil
is that the refactoring made them use the impacts enums rather than simple
postings enums to iterate doc ids.
> Leverage impacts for phrase queries
> -----------------------------------
>
> Key: LUCENE-8311
> URL: https://issues.apache.org/jira/browse/LUCENE-8311
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for
> each unique norm value in order to get upper bounds of the score for the
> phrase.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]