On Friday, November 14, 2003, at 02:02 PM, Chong, Herb wrote:
if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact that the three terms occur next to each other in the query means that documents with the three terms far apart should not get nearly as much weight in the ranking scheme. a sentence ending with two terms "capital gains" followed by a sentence starting with the term "tax" should not be a highly ranked match. that means you need sentence boundaries in the index. the indexing and the query analysis scheme has to understand the linguistic concept of a phrase, and phrases do not cross sentence boundaries.

With Lucene's analysis process, you can assign a position increment to tokens. The default value is 1, meaning its the next position. Phrase queries default to a slop of 0, meaning they must be in successive positions. When analyzing and you encounter a sentence boundary, you could set the position increment of the next word (the first word of the next sentence) to a high number (to account for users searching with potential slop, or just something greater than one if you never use sloppy phrase searches).


Does this get you closer to what you're after?

As for how to weight queries by the distance from terms.... I'll have to think on that some, but I suspect something reasonable could be done with a custom Similarity or a custom type of Query.

Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to