[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries

Adrien Grand (JIRA) Tue, 22 May 2018 02:58:30 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483737#comment-16483737
 ]


Adrien Grand commented on LUCENE-8311:
--------------------------------------

Here is a patch that builds on LUCENE-8312 and the output of a luceneutil run:

{noformat}
               LowPhrase       23.35      (2.1%)       16.05      (1.1%)  
-31.3% ( -33% -  -28%)
        HighSloppyPhrase       26.90      (5.1%)       23.84      (3.8%)  
-11.4% ( -19% -   -2%)
       HighTermMonthSort      155.27     (13.1%)      138.14     (11.0%)  
-11.0% ( -31% -   15%)
         MedSloppyPhrase       18.12      (4.6%)       16.20      (3.2%)  
-10.6% ( -17% -   -2%)
         LowSloppyPhrase      236.36      (5.4%)      218.12      (4.5%)   
-7.7% ( -16% -    2%)
   HighTermDayOfYearSort       89.47     (11.5%)       84.16     (10.1%)   
-5.9% ( -24% -   17%)
                HighTerm     1463.31      (3.9%)     1402.12      (3.4%)   
-4.2% ( -11% -    3%)
                  IntNRQ       29.88      (6.8%)       28.65      (6.8%)   
-4.1% ( -16% -   10%)
                 MedTerm     1721.26      (3.8%)     1672.73      (3.2%)   
-2.8% (  -9% -    4%)
                  Fuzzy2      112.51      (5.1%)      109.41      (4.9%)   
-2.8% ( -12% -    7%)
                 LowTerm     2469.28      (3.8%)     2414.68      (3.5%)   
-2.2% (  -9% -    5%)
             MedSpanNear       85.48      (4.1%)       84.02      (3.9%)   
-1.7% (  -9% -    6%)
            HighSpanNear       10.03      (4.4%)        9.86      (4.1%)   
-1.7% (  -9% -    7%)
                  Fuzzy1      153.76      (4.9%)      151.56      (4.0%)   
-1.4% (  -9% -    7%)
              OrHighHigh       20.38      (3.2%)       20.18      (3.0%)   
-1.0% (  -6% -    5%)
               OrHighMed       72.71      (2.5%)       72.05      (2.4%)   
-0.9% (  -5% -    4%)
                 Respell      163.99      (2.1%)      162.75      (2.3%)   
-0.8% (  -5% -    3%)
                Wildcard       39.17      (5.7%)       38.90      (5.0%)   
-0.7% ( -10% -   10%)
                 Prefix3       45.93      (7.2%)       45.72      (6.6%)   
-0.5% ( -13% -   14%)
              AndHighMed      147.08      (2.0%)      146.55      (3.1%)   
-0.4% (  -5% -    4%)
             AndHighHigh       52.33      (2.0%)       52.25      (3.6%)   
-0.2% (  -5% -    5%)
               OrHighLow      331.39      (3.4%)      334.43      (2.5%)    
0.9% (  -4% -    7%)
              AndHighLow      603.54      (3.6%)      611.77      (3.8%)    
1.4% (  -5% -    9%)
             LowSpanNear        7.87     (11.1%)        8.04      (6.9%)    
2.2% ( -14% -   22%)
               MedPhrase       94.59      (1.6%)      108.41      (1.9%)   
14.6% (  10% -   18%)
              HighPhrase       11.74      (2.8%)      109.04     (24.6%)  
828.7% ( 779% -  880%)
{noformat}

It helps HighPhrase a lot, but hurts LowPhrase a bit. More generally, this 
change helps most when at least one of the searched terms mostly occurs within 
the phrase. For instance "york" mostly appears in the "new york" phrase in the 
wikipedia corpus that we use, so the "new york" phrase gets a huge speedup. 
This is not the case for LowPhrase entries like "median age" or "his family", 
which get worse latencies because they need to read impacts from the index and 
compute score upper bounds.

I tried to implement impacts on sloppy phrases by summing up frequencies but it 
didn't help since the score upper bounds were way higher than the scores that 
were actually computed. The reason why they are slower according to luceneutil 
is that the refactoring made them use the impacts enums rather than simple 
postings enums to iterate doc ids.

> Leverage impacts for phrase queries
> -----------------------------------
>
>                 Key: LUCENE-8311
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8311
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for 
> each unique norm value in order to get upper bounds of the score for the 
> phrase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries

Reply via email to