adapting lucene's practical scoring function

Mathias Silbermann Thu, 25 Mar 2010 12:07:29 -0700

Dear Lucene Users,

I'd like to use Lucene to find scientific papers in the index that aresimilar to a given paper from theindex. This seems to be possible using the MoreLikeThis-feature orwrapping the given documentin a query composed of several other queries (BooleanQuery). Thesimilarity is calculatedaccording to Lucene's Practical Scoring Function defined in the JavaDocof class Similarity.

What I am trying to do is to calculate the "semantic documentsimilarity". One example similarityfunction for that purpose is given on page two of the paper"Corpus-based and Knowledge-basedMeasures of Text Semantic Similarity" by Rada Mihalcea (formula 1).Instead of using the TF andIDF values, it uses IDF values and the relatednesses between everyunique words in the documentsto compare. First, it sums up the relatednesses of each unique word indocument 1 to its mostrelated word in document 2 multiplied by its IDF value. The sameprocedure is done for document1.

After that, the sums are averaged.

My question is: Given I am able to store WordNet-Words extracted fromthe documents in theindex and pre-calculate the word-word similarities, is it possibe / doesit make sense (e.g. fromthe (computational) effort point of view) to adapt the Practical ScoringFunction to such a functionof semantic document similarity? And where (in which class) is thePractical Scoring Function

implemented, i.e. where are the values of TF, IDF, Boost... put together?

Regards,
Mathias Silbermann

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

adapting lucene's practical scoring function

Reply via email to