[
https://issues.apache.org/jira/browse/LUCENE-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6553:
---------------------------------
Attachment: LUCENE-6553.patch
Here is a patch that removes the handling of acceptDocs from the postings,
spans and scorer APIs, and moves it from the constructor to the score() method
for BulkScorer.
In general I think it simplifies the code a lot:
- we have lots of postings formats and query impls that do not need to care
about deleted docs at all anymore since they use the default bulk scorer
- CheckIndex does not need to test that postings formats ignore deleted docs
correctly
One thing I am unsure about is whether LeafReader.postings should still apply
deleted docs or not. At least for other call sites, there would be a
compilation error since the acceptDocs parameter was removed, but this method
did not have such a parameter and implicitely applied the reader's live docs.
For now I documented explicitly that live docs were not applied, but I could
also understand why someone would like to see live docs applied for this
method. The reason why I decided to not apply live docs is that then if you use
this method in a Query implementation, the Scorer would be illegal since it
would apply live docs while it's not supposed to.
> Simplify how we handle deleted docs in read APIs
> ------------------------------------------------
>
> Key: LUCENE-6553
> URL: https://issues.apache.org/jira/browse/LUCENE-6553
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: Trunk
>
> Attachments: LUCENE-6553.patch
>
>
> Today, all scorers and postings formats need to be able to handle deleted
> documents.
> I suspect that the reason is that we want to be able to make sure to not
> perform costly operations on documents that are deleted. For instance if you
> run a phrase query, reading positions on a document which is deleted is
> useless. I suspect this is also a source of inefficiencies since in some
> cases we apply deleted documents several times: for instance conjunctions
> apply deleted docs to every sub scorer.
> However, with the new two-phase iteration API, we have a way to make sure
> that we never run expensive operations on deleted documents: we could first
> iterate over the approximation, then check that the document is not deleted,
> and finally confirm the match. Since approximations are cheap, applying
> deleted docs after them would not be an issue.
> I would like to explore removing the "Bits acceptDocs" parameter from
> TermsEnum.postings, Weight.scorer, SpanWeight.getSpans and Weight.BulkScorer,
> and add it to BulkScorer.score. This way, bulk scorers would be the only API
> which would need to know how to apply deleted docs, which I think would be
> more manageable since we only have 3 or 4 impls. And DefaultBulkScorer would
> be implemented the way described above: first advance the approximation, then
> check deleted docs, then confirm the match, then collect. Of course that's
> only in the case the scorer supports approximations, if it does not, it means
> it is cheap so we can directly iterate the scorer and check deleted docs on
> top.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]