Re: Proximity in ranking, summary generation

Erik Hatcher Wed, 01 Dec 2004 10:34:11 -0800

On Dec 1, 2004, at 11:54 AM, Venkatraju wrote:

This is actually 2 somewhat related questions:
- In regular multi term queries, does the default ranking function of
Lucene take into account proximity of the search terms? As far as I
know, proximity data is used only in phrase searches. Is this correct?


Correct.

If so, does someone have pointers/sample implementation of how
proximity data can be used to supplement tfidf in ranking documents?


Have a look at PhraseQuery and how it does its ranking.

- Is there a way to get the term offset or byte offset of the best
match(es) in the document? I am looking to use this information for
summary generation/highlighting.

Term offset (aka position increments) is stored in the index. However, for highlighting you need byte offsets. The CVS version of Lucene includes an option to capture the byte offsets in the index also. Prior to this CVS version, it required re-analyzing the text to be highlighted. Token objects contain the byte offsets that the Highlighter package uses.

Have a look at the Highlighter code in the Lucene Sandbox.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Proximity in ranking, summary generation

Reply via email to