[ 
https://issues.apache.org/jira/browse/LUCENE-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reopened LUCENE-8069:
----------------------------------

I've had this idea come back to my mind several times since I opened it. 
Sorting by norm brings the following benefits:
 - Better compression, smaller doc IDs likely have tiny term frequencies since 
most times the term frequency is less than or equal to the norm.
 - Smaller impacts: since each block of postings has only one unique norm value 
on average, then it also only has one impact on average. This helps at search 
time since computing the score of this impact gives us immediately the best 
score of the block, as opposed to having to iterate several impacts and take 
the highest score.
 - For term queries, it makes sure that among all documents that have X 
occurrences of the queried term, we visit the documents that have the lowest 
norm first, and thus the ones that trigger the better scores.
 - Boolean queries are interesting: they get the same above benefit as term 
queries but on the other hand the norm tends to correlate with the number of 
unique terms so it might be that you need to collect more matches before you 
find one that matches several query terms.

I hacked a quick prototype and ran luceneutil on wikibig, results are 
encouraging:
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
   HighTermDayOfYearSort       37.64      (6.4%)       33.96      (4.7%)   
-9.8% ( -19% -    1%)
              HighPhrase       26.45      (2.7%)       25.24      (2.8%)   
-4.6% (  -9% -    0%)
               OrHighLow      341.59      (2.8%)      327.84      (2.6%)   
-4.0% (  -9% -    1%)
                  Fuzzy2      153.15      (5.3%)      147.70      (5.1%)   
-3.6% ( -13% -    7%)
                  IntNRQ      151.43      (1.4%)      147.04      (3.4%)   
-2.9% (  -7% -    1%)
       HighTermMonthSort       79.28      (6.4%)       79.44      (7.6%)    
0.2% ( -12% -   15%)
                 Respell      229.10      (2.2%)      230.62      (1.8%)    
0.7% (  -3% -    4%)
                  Fuzzy1      285.25      (6.9%)      288.99      (6.8%)    
1.3% ( -11% -   16%)
                 Prefix3       34.60     (10.3%)       35.14     (10.6%)    
1.6% ( -17% -   25%)
                Wildcard       72.36      (5.8%)       73.86      (6.3%)    
2.1% (  -9% -   15%)
                 MedTerm     1895.68      (4.2%)     1939.92      (4.2%)    
2.3% (  -5% -   11%)
            HighSpanNear        5.25      (6.0%)        5.46      (6.0%)    
3.9% (  -7% -   17%)
         LowSloppyPhrase        6.85      (6.5%)        7.13      (6.3%)    
4.2% (  -8% -   18%)
               LowPhrase       46.08      (1.7%)       48.56      (1.8%)    
5.4% (   1% -    9%)
             LowSpanNear       24.03      (3.7%)       25.68      (4.3%)    
6.9% (  -1% -   15%)
             MedSpanNear        5.20     (13.2%)        5.63     (15.2%)    
8.3% ( -17% -   42%)
         MedSloppyPhrase       11.01      (4.5%)       11.95      (4.7%)    
8.6% (   0% -   18%)
               MedPhrase       23.39      (2.6%)       25.64      (2.2%)    
9.6% (   4% -   14%)
        HighSloppyPhrase        3.84      (5.9%)        4.26      (5.8%)   
11.0% (   0% -   24%)
              AndHighLow      401.13      (3.4%)      458.11      (3.0%)   
14.2% (   7% -   21%)
                 LowTerm     2294.98      (4.0%)     2863.59      (7.0%)   
24.8% (  13% -   37%)
              AndHighMed       53.62      (3.8%)       71.40      (1.8%)   
33.2% (  26% -   40%)
                HighTerm     1286.59      (3.9%)     1917.61      (5.7%)   
49.0% (  38% -   60%)
             AndHighHigh       41.24      (3.5%)       69.17      (4.2%)   
67.7% (  58% -   78%)
               OrHighMed       49.92      (2.4%)       84.95      (4.0%)   
70.2% (  62% -   78%)
              OrHighHigh       43.55      (2.3%)       90.06      (4.8%)  
106.8% (  97% -  116%)
{noformat}

The {{doc}} file is 12% smaller.

> Allow index sorting by field length
> -----------------------------------
>
>                 Key: LUCENE-8069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8069
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Short documents are more likely to get higher scores, so sorting an index by 
> field length would mean we would be likely to collect best matches first. 
> Depending on the similarity implementation, this might even allow to early 
> terminate collection of top documents on term queries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to