Ah, I see. You have an absoulte interpretation. I am more relative. I think we're talking about a heuristic, not a law.

Matches within a sentence are scored higher than those that are not. And the closer matching the terms are, whether within the same sentence or not, the greater the score. Given these two principals, at some point, as sentences get longer, a close match across sentence boundaries should probably score substantially higher than a very distant match within a sentence. Thus missing some distant yet still within-sentence matches in very long sentences probably won't substantially alter the ranking. Is 100 long enough? Perhaps not. But 1000 is certainly plenty long.

Doug

Chong, Herb wrote:
any arbitrary number you pick will be broken by some document someone puts into the system.

Herb....

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, November 17, 2003 2:56 PM
To: Lucene Users List
Subject: Re: AW: inter-term correlation [was Re: Vector Space Model in Lucene?]

This is exactly the sort of approach I was advocating in earlier messages. (Although I think you'd only need to increase the position counter by 101 for the first word in each sentence.) Herb Chong didn't seem to think this was appropriate, but I never understood why.

Doug


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to