[jira] [Updated] (LUCENE-6919) Change the Scorer API to expose an iterator instead of extending DocIdSetIterator

Adrien Grand (JIRA) Mon, 07 Dec 2015 10:15:02 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-6919:
---------------------------------
    Attachment: LUCENE-6919.patch

Good point about Scorer.docId(). I think it's also better to not require 
Collectors to go though the iterator since they are not supposed to move the 
iterator. Once this change is in, maybe in the future we can expose a smaller 
interface in Collector.setScorer (that only exposes a doc ID, a freq and a 
score, as suggested in LUCENE-6228).

Here is a patch with the proposed changes. It makes some things slightly less 
complicated due to the additional level of indirection between a Weight and a 
DocIdSetIterator (eg. for delete-by-query) but it also makes some Scorer 
implementations simpler since they can now just return the iterator instead of 
reimplementing the whole DISI API and forwarding it to an existing 
DocIdSetIterator (for instance this is what conjunctions did).

For consistency, Scorer.asTwoPhaseIterator has been renamed to 
Scorer.twoPhaseIterator. So now Scorer has Scorer.iterator() which is a 
required method and returns a DocIdSetIterator and Scorer.twoPhaseIterator 
which is an optional method and returns a TwoPhaseIterator.

The luceneutil report is consistent with what I got with the hacky patch:
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
            OrNotHighLow     1119.21      (3.9%)     1078.59      (5.3%)   
-3.6% ( -12% -    5%)
              AndHighLow      796.76      (4.9%)      774.43      (5.6%)   
-2.8% ( -12% -    8%)
         MedSloppyPhrase       58.28      (4.0%)       57.67      (4.9%)   
-1.0% (  -9% -    8%)
             LowSpanNear      122.71      (2.0%)      121.65      (2.8%)   
-0.9% (  -5% -    4%)
            HighSpanNear        3.25      (2.3%)        3.22      (2.0%)   
-0.8% (  -4% -    3%)
               LowPhrase      113.56      (1.5%)      112.73      (2.1%)   
-0.7% (  -4% -    2%)
             MedSpanNear      109.63      (2.4%)      109.15      (1.7%)   
-0.4% (  -4% -    3%)
        HighSloppyPhrase        6.23      (5.0%)        6.22      (5.4%)   
-0.2% ( -10% -   10%)
            OrHighNotLow      102.42      (3.8%)      102.32      (3.5%)   
-0.1% (  -7% -    7%)
         LowSloppyPhrase       22.01      (2.3%)       22.01      (2.8%)    
0.0% (  -4% -    5%)
               MedPhrase       15.92      (1.6%)       15.94      (1.8%)    
0.1% (  -3% -    3%)
              HighPhrase       34.63      (3.1%)       34.75      (3.2%)    
0.3% (  -5% -    6%)
            OrNotHighMed      141.69      (3.3%)      142.75      (3.3%)    
0.7% (  -5% -    7%)
           OrNotHighHigh       50.74      (2.1%)       51.15      (2.7%)    
0.8% (  -3% -    5%)
                 Respell       63.24      (3.2%)       64.05      (3.1%)    
1.3% (  -4% -    7%)
           OrHighNotHigh       42.37      (2.8%)       42.92      (3.0%)    
1.3% (  -4% -    7%)
            OrHighNotMed       80.74      (2.9%)       82.18      (2.9%)    
1.8% (  -3% -    7%)
                 Prefix3      151.13      (4.7%)      155.37      (6.7%)    
2.8% (  -8% -   14%)
             AndHighHigh       36.96      (2.3%)       38.37      (2.3%)    
3.8% (   0% -    8%)
                  Fuzzy1       25.95      (5.9%)       27.00      (5.7%)    
4.0% (  -7% -   16%)
               OrHighMed       50.05      (5.0%)       52.10      (5.7%)    
4.1% (  -6% -   15%)
              OrHighHigh       33.64      (5.2%)       35.16      (4.7%)    
4.5% (  -5% -   15%)
                  IntNRQ       10.93      (6.9%)       11.46      (6.2%)    
4.8% (  -7% -   19%)
                 MedTerm      179.51      (3.8%)      188.22      (3.9%)    
4.9% (  -2% -   13%)
               OrHighLow       79.55      (2.9%)       83.56      (2.8%)    
5.0% (   0% -   11%)
                 LowTerm      682.13      (8.0%)      716.84      (6.4%)    
5.1% (  -8% -   21%)
              AndHighMed      114.21      (2.4%)      120.06      (2.4%)    
5.1% (   0% -   10%)
                Wildcard       29.31      (6.4%)       31.07      (5.8%)    
6.0% (  -5% -   19%)
                HighTerm      118.05      (3.5%)      125.83      (4.4%)    
6.6% (  -1% -   14%)
                  Fuzzy2       61.23     (20.9%)       67.19     (21.3%)    
9.7% ( -26% -   65%)
{noformat}

> Change the Scorer API to expose an iterator instead of extending 
> DocIdSetIterator
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-6919
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6919
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-6919.patch, LUCENE-6919.patch
>
>
> I was working on trying to address the performance regression on LUCENE-6815 
> but this is hard to do without introducing specialization of 
> DisjunctionScorer which I'd like to avoid at all costs.
> I think the performance regression would be easy to address without 
> specialization if Scorers were changed to return an iterator instead of 
> extending DocIdSetIterator. So conceptually the API would move from
> {code}
> class Scorer extends DocIdSetIterator {
> }
> {code}
> to
> {code}
> class Scorer {
>   DocIdSetIterator iterator();
> }
> {code}
> This would help me because then if none of the sub clauses support two-phase 
> iteration, DisjunctionScorer could directly return the approximation as an 
> iterator instead of having to check if twoPhase == null at every iteration.
> Such an approach could also help remove some method calls. For instance 
> TermScorer.nextDoc calls PostingsEnum.nextDoc but with this change 
> TermScorer.iterator() could return the PostingsEnum and TermScorer would not 
> even appear in stack traces when scoring. I hacked a patch to see how much 
> that would help and luceneutil seems to like the change:
> {noformat}
>                     TaskQPS baseline      StdDev   QPS patch      StdDev      
>           Pct diff
>                   Fuzzy1       88.54     (15.7%)       86.73     (16.6%)   
> -2.0% ( -29% -   35%)
>               AndHighLow      698.98      (4.1%)      691.11      (5.1%)   
> -1.1% (  -9% -    8%)
>                   Fuzzy2       26.47     (11.2%)       26.28     (10.3%)   
> -0.7% ( -19% -   23%)
>              MedSpanNear      141.03      (3.3%)      140.51      (3.2%)   
> -0.4% (  -6% -    6%)
>               HighPhrase       60.66      (2.6%)       60.48      (3.3%)   
> -0.3% (  -5% -    5%)
>              LowSpanNear       29.25      (2.4%)       29.21      (2.1%)   
> -0.1% (  -4% -    4%)
>                MedPhrase       28.32      (1.9%)       28.28      (2.0%)   
> -0.1% (  -3% -    3%)
>                LowPhrase       17.31      (2.1%)       17.29      (2.6%)   
> -0.1% (  -4% -    4%)
>         HighSloppyPhrase       10.93      (6.0%)       10.92      (6.0%)   
> -0.1% ( -11% -   12%)
>          MedSloppyPhrase       72.21      (2.2%)       72.27      (1.8%)    
> 0.1% (  -3% -    4%)
>                  Respell       57.35      (3.2%)       57.41      (3.4%)    
> 0.1% (  -6% -    6%)
>             HighSpanNear       26.71      (3.0%)       26.75      (2.5%)    
> 0.1% (  -5% -    5%)
>             OrNotHighLow      803.46      (3.4%)      807.03      (4.2%)    
> 0.4% (  -6% -    8%)
>          LowSloppyPhrase       88.02      (3.4%)       88.77      (2.5%)    
> 0.8% (  -4% -    7%)
>             OrNotHighMed      200.45      (2.7%)      203.83      (2.5%)    
> 1.7% (  -3% -    7%)
>               OrHighHigh       38.98      (7.9%)       40.30      (6.6%)    
> 3.4% ( -10% -   19%)
>                 HighTerm       92.53      (5.3%)       95.94      (5.8%)    
> 3.7% (  -7% -   15%)
>                OrHighMed       53.80      (7.7%)       55.79      (6.6%)    
> 3.7% (  -9% -   19%)
>               AndHighMed      266.69      (1.7%)      277.15      (2.5%)    
> 3.9% (   0% -    8%)
>                  Prefix3       44.68      (5.4%)       46.60      (7.0%)    
> 4.3% (  -7% -   17%)
>                  MedTerm      261.52      (4.9%)      273.52      (5.4%)    
> 4.6% (  -5% -   15%)
>                 Wildcard       42.39      (6.1%)       44.35      (7.8%)    
> 4.6% (  -8% -   19%)
>                   IntNRQ       10.46      (7.0%)       10.99      (9.5%)    
> 5.0% ( -10% -   23%)
>            OrNotHighHigh       67.15      (4.6%)       70.65      (4.5%)    
> 5.2% (  -3% -   15%)
>            OrHighNotHigh       43.07      (5.1%)       45.36      (5.4%)    
> 5.3% (  -4% -   16%)
>                OrHighLow       64.19      (6.4%)       67.72      (5.5%)    
> 5.5% (  -6% -   18%)
>              AndHighHigh       64.17      (2.3%)       67.87      (2.1%)    
> 5.8% (   1% -   10%)
>                  LowTerm      642.94     (10.9%)      681.48      (8.5%)    
> 6.0% ( -12% -   28%)
>             OrHighNotMed       12.68      (6.9%)       13.51      (6.6%)    
> 6.5% (  -6% -   21%)
>             OrHighNotLow       54.69      (6.8%)       58.25      (7.0%)    
> 6.5% (  -6% -   21%)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6919) Change the Scorer API to expose an iterator instead of extending DocIdSetIterator

Reply via email to