Hi Chris, Here is an approach which works based on the quantity of matching terms in an adapted BooleanQuery:
http://issues.apache.org/bugzilla/show_bug.cgi?id=35284 Paul makes an interesting obversation at the end which shows how this functionality can be added to the existing BooleanQuery without too much effort. I'd personally like to see this added to BooleanQuery. As an example application, I currently use this functionality in my custom CoordConstrainedBooleanQuery to prevent "More Like This" queries returning long lists of dissimilar documents by insisting on 30% of generated query terms matching. This approach of course is based purely on the quantity of matching terms, not the quality-based measures in your example. As you suggest, quality is a combination of user-derived measures (boosts) and data-derived measures (tf,idf, docBoost). It sounds like a more informed approach in principle but I'm not currently sure how it would be implemented efficiently in practice. Here's one possible approach I can think of: I have previously optimized large BooleanQueries generated by nGrams before now by taking only the top idf-ranked terms - purely to reduce query times. A similar approach could be used to automatically rewrite a BooleanQuery consisting of entirely optional terms into the equivalent of: +( my high idf terms) (low idf terms) Basically this produces a query that MUST match the decent terms and scores extra points for the "optional extras". Query term boosts could be factored into the decision for selecting the "Must have" terms and "nice to haves". This would help maintain a minimum level of relevance when relevance isn't the primary sort field. Cheers, Mark ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]