[
https://issues.apache.org/jira/browse/LUCENE-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257754#comment-15257754
]
Adrien Grand commented on LUCENE-7254:
--------------------------------------
bq. by the way: DocIDSetBuilder could use some of this same logic for postings
to remove its cardinality computation too: just substitute sumDocFreq for
numPoints. But its the lesser of the problems here.
There used to be an optimization for avoiding the cardinality computation but I
removed it in LUCENE-7051 (it could not use stats to estimate the number of
docs per point at that time though).
bq. For later improvements, to try to do more fancy things so a
DocIDSetBuilder-type approach works better, we can consider e.g. improving
BKDReader.addAll to e.g. target its entire range more efficiently, call grow()
less often but with bigger numbers
+1 I don't like that this patch might create iterators over sparse FixedBitSet
instances. I am fine with doing that temporarily for queries that are likely to
match many docs (I see that you modified the ranges but not the point-in-set
queries for instance) but in the longer term I think we should improve points
so that we can know earlier how many docs are going to be added.
> DocIDSetBuilder is no good for points
> -------------------------------------
>
> Key: LUCENE-7254
> URL: https://issues.apache.org/jira/browse/LUCENE-7254
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-7254.patch, LUCENE-7254.patch
>
>
> For the postings lists, I think this approach works well in dense cases (e.g.
> whole DISI's are added, things are coming in order, etc).
> However in the points case, it holds back range performance significantly.
> There are a couple of problems here:
> * expensive cardinality computation (this is a 2% hit) when its totally
> unnecessary. we can use index statistics to help here.
> * lots of conditional stuff in add(). This includes growing checks / bitset
> switching checks and so on (which happens even if you are smart and call
> grow, but this stuff all adds up).
> I dont think we should try to create a magical shared API that is both
> efficient for postings lists of unstructured stuff and at the same time point
> collection for structured fields, instead we should just do things
> differently for points and iterate from there.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]