[ 
https://issues.apache.org/jira/browse/LUCENE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753311#comment-16753311
 ] 

Amir Hadadi commented on LUCENE-7958:
-------------------------------------

[~jpountz] we have recently been bitten by this: one term had a much higher 
document frequency then the others, and since the number of terms was more than 
15, we were always paying the penalty for consuming that term into a bitset. We 
manually split the terms query and the performance improved drastically.

> Give TermInSetQuery better advancing capabilities
> -------------------------------------------------
>
>                 Key: LUCENE-7958
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7958
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7958.patch
>
>
> If a TermInSetQuery has more than 15 matching terms on a given segment, then 
> we consume all postings lists into a bitset and return an iterator over this 
> bitset as a scorer. I would like to change it so that we keep the 15 postings 
> lists that have the largest document frequencies and consume all other 
> (shorter) postings lists into a bitset. In the end we return a disjunction 
> over the N longest postings lists and the bit set. This could help consume 
> fewer doc ids if the TermInSetQuery is intersected with other queries, 
> especially if the document frequencies of the terms it wraps have a zipfian 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to