Mark, I know you've already commited a patch along these lines (LUCENE-494) and I can see how in a lot of cases that would be a great solution, but i'm still interested in the orriginal idea you proposed (a 'maxDf' in TermQuery) because i anticipate situations in which you don't want to ignore the common term as query time (because you want it to affect the result set), you just don't want to bother spending a lot of time calculating it's score contribution since it's so common -- perhaps even if an optimization can get the time down, you don't want it included because it's so common.
If i understand your description of the problem, in your profiling you've confirmed that when a term is extremely common, the "tf" portion of the calculation for each doc is expensive because of the underlying call to TermDocs.read(int[],int[]) ... is that correct? If that's the case, then it seems like a fairly straightforward and useful patch would be to add the following (untested) to TermQuery... private static int maxDocFreq = Integer.MAX_INT; private static float macDocFreqRawScore = 0.0f; public static setMaxDocFreqScore(int df, float rawScore) { maxDocFreq = df; rawScore = macDocFreqRawScore; } public rewrite(IndexReader reader) { if (maxDocFreq < reader.docFreq(term)) { // should be ConstantScoreTermQuery but it doesn't exist Query q= new ConstantScoreRangeQuery(term.field(),term.text(),term.text(),true,true) q.setBoost(macDocFreqRawScore); return q.rewrite(reader); } return this; } ...the downside compared to your existing approach is that it's still spending some time on the really common terms (build up the filter) so if you truely wantto ignore them the analyzer is a better way to go -- but the upside is that it would still allow those really common terms to affect the result set. thoughts? : Date: Tue, 07 Feb 2006 20:18:27 +0000 : From: markharw00d <[EMAIL PROTECTED]> : Reply-To: java-dev@lucene.apache.org : To: java-dev@lucene.apache.org : Subject: Re: Preventing "killer" queries : : [Answering my own question] : : I think a reasonable solution is to have a generic analyzer for use at : query-time that can wrap my application's choice of analyzer and : automatically filter out what it sees as stop words. It would initialize : itself from an IndexReader and create a StopFilter for those terms : greater than a given document frequency. : : This approach seems reasonable because: : a) The stop word filter is automatically adaptive and doesn't need : manual tuning. : b) I can live with the disk space overhead of the few "killer" terms : which will make it into the index. : c) "Silent" failure (ie removal of terms from query) is probably : generally preferable to the throw-an-exception approach taken by : BooleanQuery if clause limits are exceeded. : : : : : : : : : ___________________________________________________________ : To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]