I have seen different versions of Lucene's ranking function from the similarity document and Lucene user list.
Since I need to get document-doucment similaries, so what I do is to issue the document as query directly. I found it is different if we issue "computer computer" to Lucene vers we issue it to standard VSM. The latter one will treat "computer computer" as "computer" but Lucene doesn't. In order to illustrate my question more clear, I write a more formalized document http://www.cs.virginia.edu/~xj3a/lucene_ranking.pdf so that there is no ambiguity of those formulas. I am not asure whether I understand correctly, but the major reason comes from Lucene's query parser. It defaults each term appear once. If we issue a query term multiple times in the query string, it will result in some un-expected results. For detail information, pls refer to the attached link. thanks xiangyu jin --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
