[jira] [Commented] (LUCENE-6539) Add DocValuesNumbersQuery, like DocValuesTermsQuery but works only with long values

Michael McCandless (JIRA) Tue, 09 Jun 2015 15:02:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579637#comment-14579637
 ]


Michael McCandless commented on LUCENE-6539:
--------------------------------------------

bq. new HashSet<Long>(Arrays.asList(array)).

Good, I'll fix.

bq.  However instead of keeping adding such queries to core, I think we should 
consider moving all our doc values queries to misc since they have complicated 
trade-offs and are only useful in expert use-cases?

+1, I can move them here.

{quote}
bq. in certain cases (many terms/numbers and fewish matching hits) it should be 
faster than using TermsQuery

This comment got me confused: I think in general these queries are more 
efficient when they match many documents, ie. even when an equivalent 
TermsQuery would not be used as a lead iterator in a conjunction? I think the 
only case when such a query matching few documents would be useful would be in 
a prohibited clause since these prohibited clauses can never be used to lead 
iteration anyway and are only used in a random-access fashion?
{quote}

Hmm this is hard to think about, but yes I was thinking about the "there is 
some other MUST'd clause as the primary" and then this query is a MUST_NOT of a 
big list of numeric IDs, use case.

The per-hit cost is higher with these DocValuesXXX queries (the forward lookup 
+ check) vs visiting postings and ORing bitsets that TermsQuery does (when 
there are enough terms), but the setup cost is higher with TermsQuery since it 
must lookup many terms across N segments, which is why I thought "not matching 
too many total hits" would favor DocValueXXXQuery with a large number of terms.

E.g. in the extreme case where you pass a single term to your TemsQuery or 
DocValuesTermsQuery, matching many docs, and its the primary (only) clause in 
the query, TermsQuery should be much faster.

bq. Its ok in current form to go to sandbox, but i think this needs to be 
integrated into the inverted approach so that based on index stats, lucene can 
just do the right thing.

OK, or I can just WONTFIX this ... I just thought there are use cases where 
this post-filter approach would be much faster then the choices we have today, 
e.g. when an app has numeric IDs and wants to make big "NOT in list" clauses.

I agree it would be better if we had only TermsQuery, and then it would figure 
out which strategy is best (use doc values, use numeric doc values if ids are 
really numeric, use postings) to take depending on index stats, whether clause 
is primary or not, etc... but this seems very tricky: I can't even properly 
think about the cases, see Adrien's comment above ;)

> Add DocValuesNumbersQuery, like DocValuesTermsQuery but works only with long 
> values
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-6539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6539
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk, 5.3
>
>         Attachments: LUCENE-6539.patch
>
>
> This query accepts any document where any of the provided set of longs
> was indexed into the specified field as a numeric DV field
> (NumericDocValuesField or SortedNumericDocValuesField).  You can use
> it instead of DocValuesTermsQuery when you have field values that can
> be represented as longs.
> Like DocValuesTermsQuery, this is slowish in general, since it doesn't
> use an inverted data structure, but in certain cases (many
> terms/numbers and fewish matching hits) it should be faster than using
> TermsQuery because it's done as a "post filter" when other (faster)
> query clauses are MUST'd with it.
> In such cases it should also be faster than DocValuesTermsQuery since
> it skips having to resolve terms -> ords.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6539) Add DocValuesNumbersQuery, like DocValuesTermsQuery but works only with long values

Reply via email to