[jira] [Commented] (LUCENE-7254) DocIDSetBuilder is no good for points

Robert Muir (JIRA) Tue, 26 Apr 2016 07:29:37 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258169#comment-15258169
 ]


Robert Muir commented on LUCENE-7254:
-------------------------------------

high cardinality is unrelated. These spatial fields are also high cardinality.

the abusive case is optimizations for things like index sparsity that slow down 
the normal case. 

As far as "using points as an ID field", i don't think its ready for that for a 
number of reasons, so we shouldn't let that slow down range queries either. The 
datastructure is not even optimized for that and who knows if it will even work 
reasonly. If you have an ID field, you can index it with StringField and the 
postings have already been tweaked to hell for that. So use the best 
datastructure for the job.

Another common abuse case is "shit tons of IDs" to do super-inefficient 
graph/join type operations: but we have newSetQuery to contain that case too, 
so we don't need to keep range queries slow.

So really the range query, should be fast for ranges. that is its job and it 
should be fast at that, and bulk-collect is always the bottleneck for that 
operation.


> DocIDSetBuilder is no good for points
> -------------------------------------
>
>                 Key: LUCENE-7254
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7254
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-7254.patch, LUCENE-7254.patch
>
>
> For the postings lists, I think this approach works well in dense cases (e.g. 
> whole DISI's are added, things are coming in order, etc).
> However in the points case, it holds back range performance significantly. 
> There are a couple of problems here:
> * expensive cardinality computation (this is a 2% hit) when its totally 
> unnecessary. we can use index statistics to help here.
> * lots of conditional stuff in add(). This includes growing checks / bitset 
> switching checks and so on (which happens even if you are smart and call 
> grow, but this stuff all adds up). 
> I dont think we should try to create a magical shared API that is both 
> efficient for postings lists of unstructured stuff and at the same time point 
> collection for structured fields, instead we should just do things 
> differently for points and iterate from there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7254) DocIDSetBuilder is no good for points

Reply via email to