[jira] [Commented] (LUCENE-7462) Faster search APIs for doc values

Yonik Seeley (JIRA) Tue, 18 Oct 2016 17:41:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587192#comment-15587192
 ]


Yonik Seeley commented on LUCENE-7462:
--------------------------------------

bq. Wouldn't this mean we'd need 2X the search-time code [...]

If there were a utility to always get you a random access API?  Perhaps not.
It does seem like a majority of consumers would want the random access API 
only... things like grouping, sorting, and faceting are all driven off of 
document ids.   For each ID, we check the docvalues.  We don't actually do 
skipping/leapfrogging like a filter would do since we still need to do work for 
each document, even if the DV doesn't exist for that document.

I haven't thought about what this means for code further down the stack, but it 
does seem worth exploring in general.

> Faster search APIs for doc values
> ---------------------------------
>
>                 Key: LUCENE-7462
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7462
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: master (7.0)
>            Reporter: Adrien Grand
>            Priority: Minor
>
> While the iterator API helps deal with sparse doc values more efficiently, it 
> also makes search-time operations more costly. For instance, the old 
> random-access API allowed to compute facets on a given segment without any 
> conditionals, by just incrementing the counter at index {{ordinal+1}} while 
> the new API requires to advance the iterator if necessary and then check 
> whether it is exactly on the right document or not.
> Since it is very common for fields to exist across most documents, I suspect 
> codecs will keep an internal structure that is similar to the current codec 
> in the dense case, by having a dense representation of the data and just 
> making the iterator skip over the minority of documents that do not have a 
> value.
> I suggest that we add APIs that make things cheaper at search time. For 
> instance in the case of SORTED doc values, it could look like 
> {{LegacySortedDocValues}} with the additional restriction that documents can 
> only be consumed in order. Codecs that can implement this API efficiently 
> would hide it behind a {{SortedDocValues}} adapter, and then at search time 
> facets and comparators (which liked the {{LegacySortedDocValues}} API better) 
> would either unwrap or hide the SortedDocValues they got behind a more 
> random-access API (which would only happen in the truly sparse case if the 
> codec optimizes the dense case).
> One challenge is that we already use the same idea for hiding single-valued 
> impls behind multi-valued impls, so we would need to enforce the order in 
> which the wrapping needs to happen. At first sight, it seems that it would be 
> best to do the single-value-behind-multi-value-API wrapping above the 
> random-access-behind-iterator-API wrapping. The complexity of 
> wrapping/unwrapping in the right order could be contained in the 
> {{DocValues}} helper class.
> I think this change would also simplify search-time consumption of doc 
> values, which currently needs to spend several lines of code positioning the 
> iterator everytime it needs to do something interesting with doc values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7462) Faster search APIs for doc values

Reply via email to