[jira] [Updated] (LUCENE-6553) Simplify how we handle deleted docs in read APIs

Adrien Grand (JIRA) Tue, 23 Jun 2015 10:46:37 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-6553:
---------------------------------
    Attachment: LUCENE-6553.patch

Here is a patch that removes the handling of acceptDocs from the postings, 
spans and scorer APIs, and moves it from the constructor to the score() method 
for BulkScorer.

In general I think it simplifies the code a lot:
 - we have lots of postings formats and query impls that do not need to care 
about deleted docs at all anymore since they use the default bulk scorer
 - CheckIndex does not need to test that postings formats ignore deleted docs 
correctly

One thing I am unsure about is whether LeafReader.postings should still apply 
deleted docs or not. At least for other call sites, there would be a 
compilation error since the acceptDocs parameter was removed, but this method 
did not have such a parameter and implicitely applied the reader's live docs. 
For now I documented explicitly that live docs were not applied, but I could 
also understand why someone would like to see live docs applied for this 
method. The reason why I decided to not apply live docs is that then if you use 
this method in a Query implementation, the Scorer would be illegal since it 
would apply live docs while it's not supposed to.

> Simplify how we handle deleted docs in read APIs
> ------------------------------------------------
>
>                 Key: LUCENE-6553
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6553
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: Trunk
>
>         Attachments: LUCENE-6553.patch
>
>
> Today, all scorers and postings formats need to be able to handle deleted 
> documents.
> I suspect that the reason is that we want to be able to make sure to not 
> perform costly operations on documents that are deleted. For instance if you 
> run a phrase query, reading positions on a document which is deleted is 
> useless. I suspect this is also a source of inefficiencies since in some 
> cases we apply deleted documents several times: for instance conjunctions 
> apply deleted docs to every sub scorer.
> However, with the new two-phase iteration API, we have a way to make sure 
> that we never run expensive operations on deleted documents: we could first 
> iterate over the approximation, then check that the document is not deleted, 
> and finally confirm the match. Since approximations are cheap, applying 
> deleted docs after them would not be an issue.
> I would like to explore removing the "Bits acceptDocs" parameter from 
> TermsEnum.postings, Weight.scorer, SpanWeight.getSpans and Weight.BulkScorer, 
> and add it to BulkScorer.score. This way, bulk scorers would be the only API 
> which would need to know how to apply deleted docs, which I think would be 
> more manageable since we only have 3 or 4 impls. And DefaultBulkScorer would 
> be implemented the way described above: first advance the approximation, then 
> check deleted docs, then confirm the match, then collect. Of course that's 
> only in the case the scorer supports approximations, if it does not, it means 
> it is cheap so we can directly iterate the scorer and check deleted docs on 
> top.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6553) Simplify how we handle deleted docs in read APIs

Reply via email to