[jira] [Commented] (LUCENE-5809) Simplify ExactPhraseScorer

Michael McCandless (JIRA) Wed, 09 Jul 2014 15:23:23 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056852#comment-14056852
 ]


Michael McCandless commented on LUCENE-5809:
--------------------------------------------

This patch makes ExactPhraseScorer MUCH simpler, I like it.

I tested on wikimedium (~1 KB sized docs):
{noformat}
Report after iter 19:
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
            OrHighNotMed       35.53     (14.2%)       33.51     (17.4%)   
-5.7% ( -32% -   30%)
              HighPhrase        4.24     (11.9%)        4.01     (14.0%)   
-5.4% ( -27% -   23%)
        HighSloppyPhrase        3.38     (13.7%)        3.24     (15.1%)   
-4.3% ( -29% -   28%)
               MedPhrase      187.91     (16.4%)      180.25     (14.5%)   
-4.1% ( -30% -   32%)
         LowSloppyPhrase       41.14     (16.1%)       39.58     (17.9%)   
-3.8% ( -32% -   36%)
               LowPhrase       13.00      (8.1%)       12.63     (15.1%)   
-2.8% ( -24% -   22%)
            HighSpanNear        8.89     (18.6%)        8.67     (23.6%)   
-2.5% ( -37% -   48%)
             MedSpanNear       30.90     (13.7%)       30.16     (18.1%)   
-2.4% ( -30% -   34%)
               OrHighMed       30.97     (14.8%)       30.24     (17.0%)   
-2.4% ( -29% -   34%)
                 LowTerm      312.88     (18.0%)      306.15     (19.7%)   
-2.1% ( -33% -   43%)
                  Fuzzy2       40.30     (13.6%)       39.61     (15.8%)   
-1.7% ( -27% -   32%)
                 MedTerm      102.62     (16.5%)      101.02     (19.4%)   
-1.6% ( -32% -   41%)
            OrNotHighMed       22.54     (14.2%)       22.20     (15.0%)   
-1.5% ( -26% -   32%)
                  Fuzzy1       53.81     (14.3%)       53.00     (15.0%)   
-1.5% ( -26% -   32%)
           OrNotHighHigh       10.62     (12.9%)       10.57     (11.2%)   
-0.5% ( -21% -   27%)
                HighTerm       66.04     (24.3%)       65.94     (20.8%)   
-0.1% ( -36% -   59%)
                  IntNRQ        3.09     (19.1%)        3.08     (16.8%)   
-0.1% ( -30% -   44%)
                 Prefix3       84.61     (12.9%)       84.52     (14.2%)   
-0.1% ( -24% -   31%)
         MedSloppyPhrase        3.22     (16.4%)        3.23     (17.5%)    
0.0% ( -29% -   40%)
               OrHighLow       22.09     (16.4%)       22.21     (13.1%)    
0.5% ( -24% -   35%)
             LowSpanNear       10.10     (17.7%)       10.20     (16.8%)    
1.0% ( -28% -   43%)
              AndHighMed       32.50     (12.6%)       32.92     (10.9%)    
1.3% ( -19% -   28%)
                 Respell       44.07     (13.4%)       44.85     (14.4%)    
1.8% ( -22% -   34%)
           OrHighNotHigh       13.10     (12.4%)       13.42     (11.5%)    
2.4% ( -19% -   30%)
            OrHighNotLow       27.70     (18.9%)       28.42     (18.3%)    
2.6% ( -29% -   49%)
              AndHighLow      335.76     (17.4%)      344.72     (16.4%)    
2.7% ( -26% -   44%)
             AndHighHigh       26.54     (12.7%)       27.48     (10.2%)    
3.5% ( -17% -   30%)
            OrNotHighLow       22.53     (19.2%)       23.36     (14.5%)    
3.7% ( -25% -   46%)
              OrHighHigh        9.62     (14.2%)       10.04     (11.1%)    
4.3% ( -18% -   34%)
                Wildcard       17.96     (18.3%)       18.92     (13.4%)    
5.3% ( -22% -   45%)
{noformat}

And also on wikibig (= full sized docs, averge is ~4 KB):

{noformat}
Report after iter 19:
                    Task    QPS base      StdDev    QPS comp      StdDev        
        Pct diff
             AndHighHigh      418.84     (11.5%)      401.17     (16.9%)   
-4.2% ( -29% -   27%)
           OrNotHighHigh       98.86     (12.4%)       95.04     (14.1%)   
-3.9% ( -27% -   25%)
                 Respell       69.21     (11.5%)       66.53     (15.2%)   
-3.9% ( -27% -   25%)
                 LowTerm     1338.88      (9.5%)     1288.80      (9.8%)   
-3.7% ( -21% -   17%)
              AndHighMed      104.48      (5.9%)      100.86     (11.8%)   
-3.5% ( -19% -   15%)
            OrHighNotLow      200.80     (14.6%)      193.85     (17.7%)   
-3.5% ( -31% -   33%)
         MedSloppyPhrase        5.44     (12.0%)        5.25     (15.6%)   
-3.4% ( -27% -   27%)
                HighTerm      154.05     (23.3%)      148.81     (18.5%)   
-3.4% ( -36% -   49%)
            OrHighNotMed      181.16     (15.7%)      175.41     (16.4%)   
-3.2% ( -30% -   34%)
              OrHighHigh      141.36     (12.1%)      137.07     (14.7%)   
-3.0% ( -26% -   27%)
                  Fuzzy2      135.77     (14.3%)      131.98     (12.0%)   
-2.8% ( -25% -   27%)
                 Prefix3       34.23     (15.5%)       33.28     (20.7%)   
-2.8% ( -33% -   39%)
                Wildcard      129.89     (14.4%)      126.29     (17.7%)   
-2.8% ( -30% -   34%)
               MedPhrase        7.74     (15.8%)        7.54     (17.2%)   
-2.5% ( -30% -   36%)
            OrNotHighMed       58.84     (16.3%)       57.47     (13.0%)   
-2.3% ( -27% -   32%)
                 MedTerm      350.57     (18.1%)      344.41     (14.9%)   
-1.8% ( -29% -   38%)
                  Fuzzy1      101.74     (13.1%)      100.00     (14.5%)   
-1.7% ( -25% -   29%)
           OrHighNotHigh       44.20     (12.9%)       43.45     (13.9%)   
-1.7% ( -25% -   28%)
        HighSloppyPhrase       16.81     (16.2%)       16.63     (16.3%)   
-1.1% ( -28% -   37%)
               OrHighLow      135.97     (17.9%)      134.71     (17.0%)   
-0.9% ( -30% -   41%)
              HighPhrase       23.69     (11.9%)       23.59     (13.9%)   
-0.4% ( -23% -   28%)
                  IntNRQ       24.97     (18.5%)       24.88     (21.4%)   
-0.4% ( -33% -   48%)
         LowSloppyPhrase     5863.22     (12.3%)     5867.79     (12.0%)    
0.1% ( -21% -   27%)
            OrNotHighLow      183.47     (14.1%)      184.00     (11.9%)    
0.3% ( -22% -   30%)
             MedSpanNear        4.94     (11.3%)        4.97      (9.5%)    
0.4% ( -18% -   23%)
               OrHighMed       99.96     (15.6%)      100.52     (15.5%)    
0.6% ( -26% -   37%)
             LowSpanNear       40.66     (11.2%)       41.05     (13.0%)    
1.0% ( -20% -   28%)
               LowPhrase      135.03     (12.9%)      136.67     (16.5%)    
1.2% ( -25% -   35%)
            HighSpanNear       14.62     (12.5%)       14.85     (11.7%)    
1.6% ( -20% -   29%)
              AndHighLow      587.30     (15.4%)      602.79     (12.0%)    
2.6% ( -21% -   35%)
{noformat}

> Simplify ExactPhraseScorer
> --------------------------
>
>                 Key: LUCENE-5809
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5809
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: core/search
>            Reporter: Robert Muir
>         Attachments: LUCENE-5809.patch
>
>
> While looking at this scorer i see a few little things which are remnants of 
> the past:
> * crazy heuristics to use next() over advance(): I think it should just use 
> advance(), like conjunctionscorer. these days advance() isnt stupid anymore
> * incorrect leapfrogging. the lead scorer is never advanced if a subsequent 
> scorer goes past it, it just falls into this nextDoc() loop.
> * pre-next()'ing: we are using cost() api to sort, so there is no need to do 
> that.
> * UnionDocsAndPositionsEnum doesnt follow docsenum contract and set initial 
> doc to -1
> * postingsreader advance() doesnt need to check docFreq > BLOCK_SIZE on each 
> advance call, thats easy to remove.
> So I think really this scorer should just look like "conjunctionscorer that 
> verifies positions on match".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5809) Simplify ExactPhraseScorer

Reply via email to