On Dec 1, 2004, at 11:54 AM, Venkatraju wrote:
This is actually 2 somewhat related questions: - In regular multi term queries, does the default ranking function of Lucene take into account proximity of the search terms? As far as I know, proximity data is used only in phrase searches. Is this correct?
Correct.
If so, does someone have pointers/sample implementation of how proximity data can be used to supplement tfidf in ranking documents?
Have a look at PhraseQuery and how it does its ranking.
- Is there a way to get the term offset or byte offset of the best match(es) in the document? I am looking to use this information for summary generation/highlighting.
Term offset (aka position increments) is stored in the index. However, for highlighting you need byte offsets. The CVS version of Lucene includes an option to capture the byte offsets in the index also. Prior to this CVS version, it required re-analyzing the text to be highlighted. Token objects contain the byte offsets that the Highlighter package uses.
Have a look at the Highlighter code in the Lucene Sandbox.
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
