now you're talking. this is one way of doing it. you need to work out a heuristic to increment the counter enough that a misrecognized long sentence won't trigger this. however, one can argue that a sentence that contains 1000 words can't possibly be about one topic.
Herb.... -----Original Message----- From: Karsten Konrad [mailto:[EMAIL PROTECTED] Sent: Saturday, November 15, 2003 7:16 AM To: Lucene Users List Subject: AW: inter-term correlation [was Re: Vector Space Model in Lucene?] Anyway, Herb is right, sentence boundaries do carry a meaning and the linguistic rule could be phrased as: "Constituents (Concepts) mentioned in one sentence together have a closer relation than those that are not." I was wondering whether we could, while indexing, make a use of this by increasing the position counter by a large number, let's say 1000, whenever we encounter a sentence separator (Note, this is not trivial; not every '.' ends a sentence etc. etc. etc.). Thus, searching for "income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain would find "income tax gain" as usual, but would boost all texts where the phrases involved appear within sentence boundaries - I assume that a sentence with 100 words would be pretty unlikely, but still within the 1000 word separation done by increasing the position. No linguistics necessary, actually, but it is an application of a linguistic rule! --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
