since i am working now on financial news, here is an example: capital gains tax
if i just run this query against a million document newswire index, i know i am going to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is overrestrictive. the fact that the three terms occur next to each other in the query means that documents with the three terms far apart should not get nearly as much weight in the ranking scheme. a sentence ending with two terms "capital gains" followed by a sentence starting with the term "tax" should not be a highly ranked match. that means you need sentence boundaries in the index. the indexing and the query analysis scheme has to understand the linguistic concept of a phrase, and phrases do not cross sentence boundaries. Herb.... -----Original Message----- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 1:52 PM To: Lucene Users List Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] You mean if you have text like this: "Hello Herb. Have a nice day!", you want to prevent phrase queries for "herb have"? You could prevent sentence boundary crossing with clever use of the token position I suspect. Would that accomplish what you're after? Could you give a really dumbed down simple example of what you mean by inter-term correlation? Erik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
