[ 
https://issues.apache.org/jira/browse/LUCENE-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599016#comment-14599016
 ] 

Adrien Grand commented on LUCENE-6553:
--------------------------------------

I ran luceneutil on wikimedium10M with deleted documents:
{noformat}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
                  Fuzzy1       91.66      (7.4%)       89.66      (7.3%)   
-2.2% ( -15% -   13%)
            OrHighNotLow       62.10      (3.1%)       61.14      (4.8%)   
-1.5% (  -9% -    6%)
             LowSpanNear       27.30      (3.3%)       26.88      (4.6%)   
-1.5% (  -9% -    6%)
            OrHighNotMed       38.74      (2.8%)       38.31      (4.6%)   
-1.1% (  -8% -    6%)
        HighSloppyPhrase        3.23      (4.7%)        3.20      (4.6%)   
-1.0% (  -9% -    8%)
         MedSloppyPhrase       54.01      (2.6%)       53.55      (2.2%)   
-0.9% (  -5% -    4%)
         LowSloppyPhrase       41.78      (2.3%)       41.50      (2.5%)   
-0.7% (  -5% -    4%)
               LowPhrase       15.75      (1.2%)       15.68      (2.1%)   
-0.4% (  -3% -    2%)
               MedPhrase       14.62      (1.5%)       14.58      (2.4%)   
-0.3% (  -4% -    3%)
              HighPhrase       20.86      (3.1%)       20.86      (4.0%)   
-0.0% (  -6% -    7%)
                 Respell       94.55      (4.8%)       94.58      (4.5%)    
0.0% (  -8% -    9%)
                Wildcard       60.39      (4.4%)       60.49      (3.9%)    
0.2% (  -7% -    8%)
           OrHighNotHigh       33.38      (1.7%)       33.52      (3.3%)    
0.4% (  -4% -    5%)
            HighSpanNear        8.55      (2.4%)        8.61      (3.0%)    
0.8% (  -4% -    6%)
            OrNotHighMed      211.67      (1.6%)      214.27      (2.3%)    
1.2% (  -2% -    5%)
           OrNotHighHigh       63.32      (1.6%)       64.12      (3.1%)    
1.3% (  -3% -    6%)
            OrNotHighLow     1031.92      (3.7%)     1045.97      (4.7%)    
1.4% (  -6% -   10%)
                HighTerm      141.43      (3.7%)      143.85      (4.9%)    
1.7% (  -6% -   10%)
             MedSpanNear       34.47      (2.0%)       35.07      (2.1%)    
1.7% (  -2% -    5%)
                 MedTerm      208.01      (3.6%)      211.76      (4.8%)    
1.8% (  -6% -   10%)
              AndHighLow      819.26      (5.5%)      842.89      (5.5%)    
2.9% (  -7% -   14%)
              AndHighMed      203.53      (2.2%)      210.03      (2.1%)    
3.2% (  -1% -    7%)
                  IntNRQ       14.08      (8.1%)       14.59     (10.1%)    
3.6% ( -13% -   23%)
                 Prefix3       41.82      (7.6%)       43.52      (8.8%)    
4.1% ( -11% -   22%)
             AndHighHigh       47.54      (1.9%)       49.68      (2.2%)    
4.5% (   0% -    8%)
               OrHighMed       71.76      (5.4%)       76.11      (4.9%)    
6.1% (  -4% -   17%)
                 LowTerm      654.52      (9.3%)      695.50     (10.3%)    
6.3% ( -12% -   28%)
               OrHighLow       67.44      (5.4%)       72.46      (5.0%)    
7.4% (  -2% -   18%)
              OrHighHigh       26.92      (5.8%)       28.95      (5.4%)    
7.5% (  -3% -   19%)
                  Fuzzy2       81.71     (22.1%)       96.27     (18.1%)   
17.8% ( -18% -   74%)
{noformat}

For most queries performance is similar, but disjunctions look like they got a 
slight peformance boost with this patch.

> Simplify how we handle deleted docs in read APIs
> ------------------------------------------------
>
>                 Key: LUCENE-6553
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6553
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: Trunk
>
>         Attachments: LUCENE-6553.patch
>
>
> Today, all scorers and postings formats need to be able to handle deleted 
> documents.
> I suspect that the reason is that we want to be able to make sure to not 
> perform costly operations on documents that are deleted. For instance if you 
> run a phrase query, reading positions on a document which is deleted is 
> useless. I suspect this is also a source of inefficiencies since in some 
> cases we apply deleted documents several times: for instance conjunctions 
> apply deleted docs to every sub scorer.
> However, with the new two-phase iteration API, we have a way to make sure 
> that we never run expensive operations on deleted documents: we could first 
> iterate over the approximation, then check that the document is not deleted, 
> and finally confirm the match. Since approximations are cheap, applying 
> deleted docs after them would not be an issue.
> I would like to explore removing the "Bits acceptDocs" parameter from 
> TermsEnum.postings, Weight.scorer, SpanWeight.getSpans and Weight.BulkScorer, 
> and add it to BulkScorer.score. This way, bulk scorers would be the only API 
> which would need to know how to apply deleted docs, which I think would be 
> more manageable since we only have 3 or 4 impls. And DefaultBulkScorer would 
> be implemented the way described above: first advance the approximation, then 
> check deleted docs, then confirm the match, then collect. Of course that's 
> only in the case the scorer supports approximations, if it does not, it means 
> it is cheap so we can directly iterate the scorer and check deleted docs on 
> top.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to