You could index the prefix terms (edge ngrams), assuming your queries
are prefix queries; this way there would typically be far fewer terms
to visit than all 200 M terms.

Auto-prefix terms also tried to solves this more "automatically", so
you don't have to mess with edge ngrams, but we reverted it because of
the added code complexity and lack of real-word use cases especially
once we switched numerics from postings to dimensional points

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 22, 2016 at 1:01 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> In MultiTermConstantScoreWrapper there's this block around line 174 in 6x:
>
> do {
>   docs = termsEnum.postings(docs, PostingsEnum.NONE);
>   builder.add(docs);
> } while (termsEnum.next() != null);
>
> In the case of lots and lots of terms in a multiValued field this can
> take quite a bit of time. In my test case I have 100K docs with 200M
> terms (pathological I understand, but it illustrates the issue). If
> I'm reading this right it loops through all the terms and, for each
> term, creates a sub-list of docs for the term and adds the sub-list to
> the "master list". So a query like 'field:*' takes 20+ seconds.
>
> Is there anything we can/should do to short circuit this kind of
> thing? In this case I got 200M terms by ngramming 3-32 (again, far too
> many ngrams I understand). It's not clear to me whether it's an easy
> check to say "stop when all the docs have been added to the master
> list"....
>
> I can raise a JIRA if it makes sense.
>
> For supporting this particular use-case, we could index a separate
> field "has_field1_value" but the general case still holds.
>
> Erick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to