Greetings all, I will be presenting a poster at NAACL 2010 in Los Angeles about some recent experiments I've done with WordNet::Similarity. The question asked by the poster is whether we really need sense tagged text in order to get reliable results from the Information Content based measures (these include the Lin (lin), Resnik (res) and Jiang & Conrath (jcn) measures).
The answer I report in this paper is "no". In fact you seem to get somewhat better results just by using a moderate amount of raw text. Information Content Measures of Semantic Similarity Perform Better Without Sense-Tagged Text (Pedersen) - To Appear in the Proceedings of the 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), June 1-6, 2010, Los Angeles, CA http://www.d.umn.edu/~tpederse/Pubs/pedersen-naacl-2010.pdf The main reason I did these experiments and wrote this paper was that I've noticed that many WordNet::Similarity users who use the Information Content measures tend to rely on the default information content file we provide (which is based on SemCor, a sense tagged corpus). This paper is hoping to persuade such users to at least consider getting their Information Content values from raw/plain text (which is very easily done with our utility program rawtextFreq.pl (http://search.cpan.org/~tpederse/WordNet-Similarity/utils/rawtextFreq.pl) If you'll be at NAACL please do plan on stopping by during the poster session and visiting. Details on the schedule are available here : http://naaclhlt2010.isi.edu/ If you won't be but are interested in the issue, please check out the paper and let me know if you have any questions, or aren't sure how to proceed in getting your own info content files. I'd be happy to help. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse