Thanks for the comments, Chris/Doug. Chris, although I suggested it initially, I'm now a little uncomfortable in controlling this issue with a static variable in TermQuery because it doesnt let me have different settings for different queries, indexes or fields. Doug, I'd ideally like to optimize for this condition in advance rather than get into trouble and throw exceptions to blow out queries.
I like to think of the ideal solution as a control which automatically identifies and tunes out what it sees as stop words but is controllable on a per index, per field and per query basis, if needs be. The analyzer seemed a reasonably flexible way to do this. I tried looking at performance of Filter vs Query on a 1million doc index as per Chris's suggestion and found that RangeFilter.bits() does improve on search.search(TermQuery) and that this improvement was a constant factor as df increases. The filter.bits call was typically 60% of the equivalent TermQuery search time for a range of tested DFs. However, both filter and query response times increase in a linear fashion with increases in df so I suspect they are both ultimately heading for trouble as data volumes increase - just that TermQuery gets there sooner than filter. I'd rather head this problem off sooner by stop-wording very common terms in large indexes using the analyzer. Obviously this wouldn't catch Range/Fuzzy queries which expand at rewrite time but at large levels of data you have to manage those types of query carefully anyway. I did come across a bizarre anomaly I would be interested to have explained. A RangeFilter based on a single term with 50% df responds in the same time as a RangeFilter on a different field for a term with the same df. When it comes to TermQuerys though, not all fields are equal. Using a TermQuery on a "free text" field with many values for a single term with 50% df takes half the time of a TermQuery on a constrained field ("doctype") for a single term with similar df. The doctype field only ever has one of 6 possible values. Both queries are on the same index, and similar df values. The relative performance difference was the same for other DFs I tested across the 2 fields. What is going on here? If anything, I might have expected the open-ended field to be slower. Cheers, Mark ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]