[ https://issues.apache.org/jira/browse/LUCENE-4628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542954#comment-13542954 ]
Matthew Willson commented on LUCENE-4628: ----------------------------------------- This is nice. Would it make sense, perhaps, to base the cut-off on the *cumulative* document frequency -- so sort terms by DF, then add terms into the MUST subquery one at a time until a limit is exceeded on the total DF of all terms added. Then the remaining terms get added into a SHOULD subquery. This seems like it would set an upper bound on the total number of documents scored, or the total number of postings list entries which need to be inspected to select documents for scoring. (Good chance I'm missing something here mind...) Whereas a cut-off based on per-term doc frequency, you could have arbitrarily many terms introduced into the MUST subquery, provided they all slip under the per-term DF threshold. And hence arbitrarily many documents scored. > Add common terms query to gracefully handle very high frequent terms > dynamically > -------------------------------------------------------------------------------- > > Key: LUCENE-4628 > URL: https://issues.apache.org/jira/browse/LUCENE-4628 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/other > Reporter: Simon Willnauer > Assignee: Simon Willnauer > Priority: Minor > Fix For: 4.1, 5.0 > > Attachments: LUCENE-4628.patch, LUCENE-4628.patch > > > I had this problem quite a couple of times the last couple of month that > searches very often contained super high frequent terms and disjunction > queries became way too slow. The main problem was that stopword filtering > wasn't really an option since in the domain those high-freq terms where not > really stopwords though. So for instance searching for a song title "this is > it" or for a band "A" didn't really fly with stopwords. I thought about that > for a while and came up with a query based solution that decides based on a > threshold if something is considered a stopword or not and if so it moves the > term in two boolean queries one for high-frequent and one for low-frequent > such that those high frequent terms are only matched if the low-frequent > sub-query produces a match. Yet if all terms are high frequent it makes the > entire thing a Conjunction which gave me reasonable results as well as > performance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org