Dear Lucene Users,

I'd like to use Lucene to find scientific papers in the index that are similar to a given paper from the index. This seems to be possible using the MoreLikeThis-feature or wrapping the given document in a query composed of several other queries (BooleanQuery). The similarity is calculated according to Lucene's Practical Scoring Function defined in the JavaDoc of class Similarity.

What I am trying to do is to calculate the "semantic document similarity". One example similarity function for that purpose is given on page two of the paper "Corpus-based and Knowledge-based Measures of Text Semantic Similarity" by Rada Mihalcea (formula 1). Instead of using the TF and IDF values, it uses IDF values and the relatednesses between every unique words in the documents to compare. First, it sums up the relatednesses of each unique word in document 1 to its most related word in document 2 multiplied by its IDF value. The same procedure is done for document1.
After that, the sums are averaged.

My question is: Given I am able to store WordNet-Words extracted from the documents in the index and pre-calculate the word-word similarities, is it possibe / does it make sense (e.g. from the (computational) effort point of view) to adapt the Practical Scoring Function to such a function of semantic document similarity? And where (in which class) is the Practical Scoring Function
implemented, i.e. where are the values of TF, IDF, Boost... put together?

Regards,
Mathias Silbermann

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to