now you're talking. this is one way of doing it. you need to work out a heuristic to 
increment the counter enough that a misrecognized long sentence won't trigger this. 
however, one can argue that a sentence that contains 1000 words can't possibly be 
about one topic.

Herb....

-----Original Message-----
From: Karsten Konrad [mailto:[EMAIL PROTECTED]
Sent: Saturday, November 15, 2003 7:16 AM
To: Lucene Users List
Subject: AW: inter-term correlation [was Re: Vector Space Model in
Lucene?]

Anyway, Herb is right, sentence boundaries do carry a meaning and the 
linguistic rule could be phrased as: "Constituents (Concepts) mentioned 
in one sentence together have a closer relation than those that are not."

I was wondering whether we could, while indexing, make a use of this by 
increasing the position counter by a large number, let's say 1000, 
whenever we encounter a sentence separator (Note, this is not trivial; 
not every '.' ends a  sentence etc. etc. etc.). Thus, searching for

"income tax"~100 "tax gain"~100 "income tax gain"~100 income tax gain

would find "income tax gain" as usual, but would boost all texts
where the phrases involved appear within sentence boundaries - I 
assume that a sentence with 100 words would be pretty unlikely,
but still within the 1000 word separation done by increasing the
position. No linguistics necessary, actually, but it is an application
of a linguistic rule!

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to