[
https://issues.apache.org/jira/browse/LUCENE-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599016#comment-14599016
]
Adrien Grand commented on LUCENE-6553:
--------------------------------------
I ran luceneutil on wikimedium10M with deleted documents:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
Fuzzy1 91.66 (7.4%) 89.66 (7.3%)
-2.2% ( -15% - 13%)
OrHighNotLow 62.10 (3.1%) 61.14 (4.8%)
-1.5% ( -9% - 6%)
LowSpanNear 27.30 (3.3%) 26.88 (4.6%)
-1.5% ( -9% - 6%)
OrHighNotMed 38.74 (2.8%) 38.31 (4.6%)
-1.1% ( -8% - 6%)
HighSloppyPhrase 3.23 (4.7%) 3.20 (4.6%)
-1.0% ( -9% - 8%)
MedSloppyPhrase 54.01 (2.6%) 53.55 (2.2%)
-0.9% ( -5% - 4%)
LowSloppyPhrase 41.78 (2.3%) 41.50 (2.5%)
-0.7% ( -5% - 4%)
LowPhrase 15.75 (1.2%) 15.68 (2.1%)
-0.4% ( -3% - 2%)
MedPhrase 14.62 (1.5%) 14.58 (2.4%)
-0.3% ( -4% - 3%)
HighPhrase 20.86 (3.1%) 20.86 (4.0%)
-0.0% ( -6% - 7%)
Respell 94.55 (4.8%) 94.58 (4.5%)
0.0% ( -8% - 9%)
Wildcard 60.39 (4.4%) 60.49 (3.9%)
0.2% ( -7% - 8%)
OrHighNotHigh 33.38 (1.7%) 33.52 (3.3%)
0.4% ( -4% - 5%)
HighSpanNear 8.55 (2.4%) 8.61 (3.0%)
0.8% ( -4% - 6%)
OrNotHighMed 211.67 (1.6%) 214.27 (2.3%)
1.2% ( -2% - 5%)
OrNotHighHigh 63.32 (1.6%) 64.12 (3.1%)
1.3% ( -3% - 6%)
OrNotHighLow 1031.92 (3.7%) 1045.97 (4.7%)
1.4% ( -6% - 10%)
HighTerm 141.43 (3.7%) 143.85 (4.9%)
1.7% ( -6% - 10%)
MedSpanNear 34.47 (2.0%) 35.07 (2.1%)
1.7% ( -2% - 5%)
MedTerm 208.01 (3.6%) 211.76 (4.8%)
1.8% ( -6% - 10%)
AndHighLow 819.26 (5.5%) 842.89 (5.5%)
2.9% ( -7% - 14%)
AndHighMed 203.53 (2.2%) 210.03 (2.1%)
3.2% ( -1% - 7%)
IntNRQ 14.08 (8.1%) 14.59 (10.1%)
3.6% ( -13% - 23%)
Prefix3 41.82 (7.6%) 43.52 (8.8%)
4.1% ( -11% - 22%)
AndHighHigh 47.54 (1.9%) 49.68 (2.2%)
4.5% ( 0% - 8%)
OrHighMed 71.76 (5.4%) 76.11 (4.9%)
6.1% ( -4% - 17%)
LowTerm 654.52 (9.3%) 695.50 (10.3%)
6.3% ( -12% - 28%)
OrHighLow 67.44 (5.4%) 72.46 (5.0%)
7.4% ( -2% - 18%)
OrHighHigh 26.92 (5.8%) 28.95 (5.4%)
7.5% ( -3% - 19%)
Fuzzy2 81.71 (22.1%) 96.27 (18.1%)
17.8% ( -18% - 74%)
{noformat}
For most queries performance is similar, but disjunctions look like they got a
slight peformance boost with this patch.
> Simplify how we handle deleted docs in read APIs
> ------------------------------------------------
>
> Key: LUCENE-6553
> URL: https://issues.apache.org/jira/browse/LUCENE-6553
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: Trunk
>
> Attachments: LUCENE-6553.patch
>
>
> Today, all scorers and postings formats need to be able to handle deleted
> documents.
> I suspect that the reason is that we want to be able to make sure to not
> perform costly operations on documents that are deleted. For instance if you
> run a phrase query, reading positions on a document which is deleted is
> useless. I suspect this is also a source of inefficiencies since in some
> cases we apply deleted documents several times: for instance conjunctions
> apply deleted docs to every sub scorer.
> However, with the new two-phase iteration API, we have a way to make sure
> that we never run expensive operations on deleted documents: we could first
> iterate over the approximation, then check that the document is not deleted,
> and finally confirm the match. Since approximations are cheap, applying
> deleted docs after them would not be an issue.
> I would like to explore removing the "Bits acceptDocs" parameter from
> TermsEnum.postings, Weight.scorer, SpanWeight.getSpans and Weight.BulkScorer,
> and add it to BulkScorer.score. This way, bulk scorers would be the only API
> which would need to know how to apply deleted docs, which I think would be
> more manageable since we only have 3 or 4 impls. And DefaultBulkScorer would
> be implemented the way described above: first advance the approximation, then
> check deleted docs, then confirm the match, then collect. Of course that's
> only in the case the scorer supports approximations, if it does not, it means
> it is cheap so we can directly iterate the scorer and check deleted docs on
> top.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]