[
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand reopened LUCENE-8069:
----------------------------------
I've had this idea come back to my mind several times since I opened it.
Sorting by norm brings the following benefits:
- Better compression, smaller doc IDs likely have tiny term frequencies since
most times the term frequency is less than or equal to the norm.
- Smaller impacts: since each block of postings has only one unique norm value
on average, then it also only has one impact on average. This helps at search
time since computing the score of this impact gives us immediately the best
score of the block, as opposed to having to iterate several impacts and take
the highest score.
- For term queries, it makes sure that among all documents that have X
occurrences of the queried term, we visit the documents that have the lowest
norm first, and thus the ones that trigger the better scores.
- Boolean queries are interesting: they get the same above benefit as term
queries but on the other hand the norm tends to correlate with the number of
unique terms so it might be that you need to collect more matches before you
find one that matches several query terms.
I hacked a quick prototype and ran luceneutil on wikibig, results are
encouraging:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
HighTermDayOfYearSort 37.64 (6.4%) 33.96 (4.7%)
-9.8% ( -19% - 1%)
HighPhrase 26.45 (2.7%) 25.24 (2.8%)
-4.6% ( -9% - 0%)
OrHighLow 341.59 (2.8%) 327.84 (2.6%)
-4.0% ( -9% - 1%)
Fuzzy2 153.15 (5.3%) 147.70 (5.1%)
-3.6% ( -13% - 7%)
IntNRQ 151.43 (1.4%) 147.04 (3.4%)
-2.9% ( -7% - 1%)
HighTermMonthSort 79.28 (6.4%) 79.44 (7.6%)
0.2% ( -12% - 15%)
Respell 229.10 (2.2%) 230.62 (1.8%)
0.7% ( -3% - 4%)
Fuzzy1 285.25 (6.9%) 288.99 (6.8%)
1.3% ( -11% - 16%)
Prefix3 34.60 (10.3%) 35.14 (10.6%)
1.6% ( -17% - 25%)
Wildcard 72.36 (5.8%) 73.86 (6.3%)
2.1% ( -9% - 15%)
MedTerm 1895.68 (4.2%) 1939.92 (4.2%)
2.3% ( -5% - 11%)
HighSpanNear 5.25 (6.0%) 5.46 (6.0%)
3.9% ( -7% - 17%)
LowSloppyPhrase 6.85 (6.5%) 7.13 (6.3%)
4.2% ( -8% - 18%)
LowPhrase 46.08 (1.7%) 48.56 (1.8%)
5.4% ( 1% - 9%)
LowSpanNear 24.03 (3.7%) 25.68 (4.3%)
6.9% ( -1% - 15%)
MedSpanNear 5.20 (13.2%) 5.63 (15.2%)
8.3% ( -17% - 42%)
MedSloppyPhrase 11.01 (4.5%) 11.95 (4.7%)
8.6% ( 0% - 18%)
MedPhrase 23.39 (2.6%) 25.64 (2.2%)
9.6% ( 4% - 14%)
HighSloppyPhrase 3.84 (5.9%) 4.26 (5.8%)
11.0% ( 0% - 24%)
AndHighLow 401.13 (3.4%) 458.11 (3.0%)
14.2% ( 7% - 21%)
LowTerm 2294.98 (4.0%) 2863.59 (7.0%)
24.8% ( 13% - 37%)
AndHighMed 53.62 (3.8%) 71.40 (1.8%)
33.2% ( 26% - 40%)
HighTerm 1286.59 (3.9%) 1917.61 (5.7%)
49.0% ( 38% - 60%)
AndHighHigh 41.24 (3.5%) 69.17 (4.2%)
67.7% ( 58% - 78%)
OrHighMed 49.92 (2.4%) 84.95 (4.0%)
70.2% ( 62% - 78%)
OrHighHigh 43.55 (2.3%) 90.06 (4.8%)
106.8% ( 97% - 116%)
{noformat}
The {{doc}} file is 12% smaller.
> Allow index sorting by field length
> -----------------------------------
>
> Key: LUCENE-8069
> URL: https://issues.apache.org/jira/browse/LUCENE-8069
> Project: Lucene - Core
> Issue Type: Wish
> Reporter: Adrien Grand
> Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by
> field length would mean we would be likely to collect best matches first.
> Depending on the similarity implementation, this might even allow to early
> terminate collection of top documents on term queries.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]