RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

Chong, Herb Fri, 14 Nov 2003 12:06:43 -0800

since i am working now on financial news, here is an example:

capital gains tax


if i just run this query against a million document newswire index, i know i am going 
to get lots of hits. the phrase "capital gains tax" hits a lot fewer documents, but is 
overrestrictive. the fact that the three terms occur next to each other in the query 
means that documents with the three terms far apart should not get nearly as much 
weight in the ranking scheme. a sentence ending with two terms "capital gains" 
followed by a sentence starting with the term "tax" should not be a highly ranked 
match. that means you need sentence boundaries in the index. the indexing and the 
query analysis scheme has to understand the linguistic concept of a phrase, and 
phrases do not cross sentence boundaries.

Herb....

-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 1:52 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

You mean if you have text like this: "Hello Herb.  Have a nice day!", 
you want to prevent phrase queries for "herb have"?  You could prevent 
sentence boundary crossing with clever use of the token position I 
suspect.  Would that accomplish what you're after?

Could you give a really dumbed down simple example of what you mean by 
inter-term correlation?

        Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

Reply via email to