I've just been doing some benchmarking on a reasonably
large-scale system (38 million docs) and ran into an
issue where certain *very* common terms would
dramatically slow query responses. 
Some terms were abnormally common because I had
constructed the index by taking several copies and
merging them. Address data from this small sample area
had the county name reproduced massively.
Consequently a termQuery for the county name (with 50%
docFreq) in a scaled-up 38m doc index took 2 seconds
to return whereas most "normal" terms (<10% df) took a
matter of milliseconds.

Of course the solution for most situations is to use a
stop-word list at index time but that requires some
manual configuration and prior knowledge of the data
which isn't always ideal.

For these outlier situations is it worth adding a
"maxDf" property to TermQuery like BooleanQuery's
maxClause query-time control? I could fix my problem
in my own app-specific query construction code but I
wonder if others would find it a useful fix to add to
TermQuery in the Lucene core?


Cheers,
Mark






                
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to