something in the index needs to mark sentence boundaries so that words close together in the query and found close together in the text get penalized for being separated by a boundary. proximity isn't enough because in more complex queries, some of the intermediate words might be omitted. in other words, A near B near C works only when A, B, and C occur close together and fails when B is omitted, but in English, A or B sometimes do get omitted after the first mention of the phrase A B C.
as far as the ranking calculation goes, the distance itself isn't a direct factor. put it this way, imagine you have a sliding window on the query of some number of terms. suppose you have a 5 term query. an exact match of the 5 terms on the window ought to match higher than 4 out of 5. similarly, matching 4 terms out of 5 with the 5th term nearby but in another sentence should not rank as high. 5, BTW, is the magic number for English and other languages that use the same rules for multiword term composition. read this - http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf. read the section on Lexical Affinities. also find this paper: Y. Maarek and F. Smadja. Full text indexing based on lexical relations: An application: Software libraries. In Proceedings of the 12th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 198--206, 1989. Herb.... -----Original Message----- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2003 2:10 PM To: Lucene Users List Subject: Re: inter-term correlation [was Re: Vector Space Model in Lucene?] With Lucene's analysis process, you can assign a position increment to tokens. The default value is 1, meaning its the next position. Phrase queries default to a slop of 0, meaning they must be in successive positions. When analyzing and you encounter a sentence boundary, you could set the position increment of the next word (the first word of the next sentence) to a high number (to account for users searching with potential slop, or just something greater than one if you never use sloppy phrase searches). Does this get you closer to what you're after? As for how to weight queries by the distance from terms.... I'll have to think on that some, but I suspect something reasonable could be done with a custom Similarity or a custom type of Query. Erik --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
