RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

Chong, Herb Fri, 14 Nov 2003 12:33:18 -0800

something in the index needs to mark sentence boundaries so that words close together 
in the query and found close together in the text get penalized for being separated by 
a boundary. proximity isn't enough because in more complex queries, some of the 
intermediate words might be omitted. in other words, A near B near C works only when 
A, B, and C occur close together and fails when B is omitted, but in English, A or B 
sometimes do get omitted after the first mention of the phrase A B C.


as far as the ranking calculation goes, the distance itself isn't a direct factor. put 
it this way, imagine you have a sliding window on the query of some number of terms. 
suppose you have a 5 term query. an exact match of the 5 terms on the window ought to 
match higher than 4 out of 5. similarly, matching 4 terms out of 5 with the 5th term 
nearby but in another sentence should not rank as high. 5, BTW, is the magic number 
for English and other languages that use the same rules for multiword term composition.

read this - http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf. read the section 
on Lexical Affinities. also find this paper: Y. Maarek and F. Smadja. Full text 
indexing based on lexical relations: An application: Software libraries. In 
Proceedings of the 12th International ACM SIGIR Conference on Research and Development 
in Information Retrieval, pages 198--206, 1989.

Herb....
-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 2:10 PM
To: Lucene Users List
Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?]


With Lucene's analysis process, you can assign a position increment to 
tokens.  The default value is 1, meaning its the next position.  Phrase 
queries default to a slop of 0, meaning they must be in successive 
positions.  When analyzing and you encounter a sentence boundary, you 
could set the position increment of the next word (the first word of 
the next sentence) to a high number (to account for users searching 
with potential slop, or just something greater than one if you never 
use sloppy phrase searches).

Does this get you closer to what you're after?

As for how to weight queries by the distance from terms.... I'll have 
to think on that some, but I suspect something reasonable could be done 
with a custom Similarity or a custom type of Query.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: inter-term correlation [was Re: Vector Space Model in Lucene?]

Reply via email to