[jira] [Commented] (LUCENE-6553) Simplify how we handle deleted docs in read APIs

Adrien Grand (JIRA) Wed, 24 Jun 2015 01:14:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599055#comment-14599055
 ]


Adrien Grand commented on LUCENE-6553:
--------------------------------------

Luceneutil on wikimedium10M again, but without deleted documents this time:

{code}
                  IntNRQ        9.57      (5.8%)        9.31      (6.6%)   
-2.7% ( -14% -   10%)
                 Prefix3      253.58      (3.5%)      249.27      (3.4%)   
-1.7% (  -8% -    5%)
                 LowTerm      695.13      (2.9%)      685.91      (2.9%)   
-1.3% (  -6% -    4%)
                Wildcard       51.13      (3.6%)       50.49      (4.3%)   
-1.3% (  -8% -    6%)
         LowSloppyPhrase       13.87      (5.3%)       13.71      (5.4%)   
-1.1% ( -11% -   10%)
               MedPhrase       99.70      (3.2%)       98.69      (4.3%)   
-1.0% (  -8% -    6%)
                  Fuzzy1       86.60     (11.0%)       85.75     (11.0%)   
-1.0% ( -20% -   23%)
                 Respell      103.93      (3.3%)      103.18      (3.5%)   
-0.7% (  -7% -    6%)
        HighSloppyPhrase        8.18      (5.6%)        8.13      (5.9%)   
-0.7% ( -11% -   11%)
               OrHighLow       55.24      (6.4%)       54.90      (6.9%)   
-0.6% ( -13% -   13%)
              HighPhrase        8.42      (5.9%)        8.37      (6.4%)   
-0.6% ( -12% -   12%)
               OrHighMed       19.64      (6.4%)       19.52      (7.2%)   
-0.6% ( -13% -   13%)
               LowPhrase       58.69      (2.2%)       58.34      (2.4%)   
-0.6% (  -5% -    4%)
         MedSloppyPhrase       43.44      (5.4%)       43.21      (5.3%)   
-0.5% ( -10% -   10%)
              OrHighHigh       39.31      (6.5%)       39.14      (6.9%)   
-0.4% ( -12% -   13%)
              AndHighLow      690.71      (5.1%)      688.77      (4.3%)   
-0.3% (  -9% -    9%)
            OrNotHighMed      153.25      (1.8%)      152.97      (1.9%)   
-0.2% (  -3% -    3%)
             AndHighHigh       65.10      (2.6%)       65.08      (3.2%)   
-0.0% (  -5% -    5%)
           OrNotHighHigh       46.47      (1.4%)       46.47      (1.9%)   
-0.0% (  -3% -    3%)
              AndHighMed      168.75      (2.3%)      168.79      (2.2%)    
0.0% (  -4% -    4%)
             MedSpanNear       61.15      (3.9%)       61.41      (3.5%)    
0.4% (  -6% -    8%)
            OrNotHighLow     1137.12      (4.0%)     1142.11      (3.5%)    
0.4% (  -6% -    8%)
           OrHighNotHigh       54.49      (1.7%)       54.74      (1.9%)    
0.5% (  -3% -    4%)
             LowSpanNear       14.95      (2.8%)       15.02      (2.9%)    
0.5% (  -5% -    6%)
            OrHighNotMed       41.44      (2.5%)       41.73      (2.6%)    
0.7% (  -4% -    5%)
                 MedTerm      289.16      (3.5%)      292.24      (2.9%)    
1.1% (  -5% -    7%)
            OrHighNotLow       87.80      (3.3%)       88.86      (3.1%)    
1.2% (  -5% -    7%)
                HighTerm       81.86      (3.9%)       83.56      (3.5%)    
2.1% (  -5% -    9%)
            HighSpanNear       42.21      (3.5%)       43.33      (4.2%)    
2.6% (  -4% -   10%)
                  Fuzzy2       58.86     (15.6%)       60.45      (9.4%)    
2.7% ( -19% -   32%)
{code}

All differences look like noise to me?

> Simplify how we handle deleted docs in read APIs
> ------------------------------------------------
>
>                 Key: LUCENE-6553
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6553
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: Trunk
>
>         Attachments: LUCENE-6553.patch
>
>
> Today, all scorers and postings formats need to be able to handle deleted 
> documents.
> I suspect that the reason is that we want to be able to make sure to not 
> perform costly operations on documents that are deleted. For instance if you 
> run a phrase query, reading positions on a document which is deleted is 
> useless. I suspect this is also a source of inefficiencies since in some 
> cases we apply deleted documents several times: for instance conjunctions 
> apply deleted docs to every sub scorer.
> However, with the new two-phase iteration API, we have a way to make sure 
> that we never run expensive operations on deleted documents: we could first 
> iterate over the approximation, then check that the document is not deleted, 
> and finally confirm the match. Since approximations are cheap, applying 
> deleted docs after them would not be an issue.
> I would like to explore removing the "Bits acceptDocs" parameter from 
> TermsEnum.postings, Weight.scorer, SpanWeight.getSpans and Weight.BulkScorer, 
> and add it to BulkScorer.score. This way, bulk scorers would be the only API 
> which would need to know how to apply deleted docs, which I think would be 
> more manageable since we only have 3 or 4 impls. And DefaultBulkScorer would 
> be implemented the way described above: first advance the approximation, then 
> check deleted docs, then confirm the match, then collect. Of course that's 
> only in the case the scorer supports approximations, if it does not, it means 
> it is cheap so we can directly iterate the scorer and check deleted docs on 
> top.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6553) Simplify how we handle deleted docs in read APIs

Reply via email to