since i am working now on financial news, here is an example:
capital gains tax
if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact that the three terms occur next to each other in the query means that documents with the three terms far apart should not get nearly as much weight in the ranking scheme. a sentence ending with two terms "capital gains" followed by a sentence starting with the term "tax" should not be a highly ranked match. that means you need sentence boundaries in the index. the indexing and the query analysis scheme has to understand the linguistic concept of a phrase, and phrases do not cross sentence boundaries.
Have sentence boundaries actually proven to be that userful in this sort of thing. For example, if the text were something like:
"Such sales would be considered long term capital gains. Tax on these is 20%."
Then penalizing for the sentence boundary wouldn't be valid.
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
