I've just been doing some benchmarking on a reasonably large-scale system (38 million docs) and ran into an issue where certain *very* common terms would dramatically slow query responses. Some terms were abnormally common because I had constructed the index by taking several copies and merging them. Address data from this small sample area had the county name reproduced massively. Consequently a termQuery for the county name (with 50% docFreq) in a scaled-up 38m doc index took 2 seconds to return whereas most "normal" terms (<10% df) took a matter of milliseconds.
Of course the solution for most situations is to use a stop-word list at index time but that requires some manual configuration and prior knowledge of the data which isn't always ideal. For these outlier situations is it worth adding a "maxDf" property to TermQuery like BooleanQuery's maxClause query-time control? I could fix my problem in my own app-specific query construction code but I wonder if others would find it a useful fix to add to TermQuery in the Lucene core? Cheers, Mark ___________________________________________________________ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]