[
https://issues.apache.org/jira/browse/LUCENE-7958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753311#comment-16753311
]
Amir Hadadi commented on LUCENE-7958:
-------------------------------------
[~jpountz] we have recently been bitten by this: one term had a much higher
document frequency then the others, and since the number of terms was more than
15, we were always paying the penalty for consuming that term into a bitset. We
manually split the terms query and the performance improved drastically.
> Give TermInSetQuery better advancing capabilities
> -------------------------------------------------
>
> Key: LUCENE-7958
> URL: https://issues.apache.org/jira/browse/LUCENE-7958
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7958.patch
>
>
> If a TermInSetQuery has more than 15 matching terms on a given segment, then
> we consume all postings lists into a bitset and return an iterator over this
> bitset as a scorer. I would like to change it so that we keep the 15 postings
> lists that have the largest document frequencies and consume all other
> (shorter) postings lists into a bitset. In the end we return a disjunction
> over the N longest postings lists and the bit set. This could help consume
> fewer doc ids if the TermInSetQuery is intersected with other queries,
> especially if the document frequencies of the terms it wraps have a zipfian
> distribution.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]