Greetings all,

I will be presenting a poster at NAACL 2010 in Los Angeles about some
recent experiments I've done with WordNet::Similarity. The question
asked by the poster is whether we really need sense tagged text in
order to get reliable results from the Information Content based
measures (these include the Lin (lin), Resnik (res) and Jiang &
Conrath (jcn) measures).

The answer I report in this paper is "no". In fact you seem to get
somewhat better results just by using a moderate amount of raw text.

Information Content Measures of Semantic Similarity Perform Better
Without Sense-Tagged Text  (Pedersen) - To Appear in the Proceedings
of the 11th Annual Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL HLT 2010), June 1-6,
2010, Los Angeles, CA
http://www.d.umn.edu/~tpederse/Pubs/pedersen-naacl-2010.pdf

The main reason I did these experiments and wrote this paper was that
I've noticed that many WordNet::Similarity users who use the
Information Content measures tend to rely on the default information
content file we provide (which is based on SemCor, a sense tagged
corpus). This paper is hoping to persuade such users to at least
consider getting their Information Content values from raw/plain text
(which is very easily done with our utility program rawtextFreq.pl
(http://search.cpan.org/~tpederse/WordNet-Similarity/utils/rawtextFreq.pl)

If you'll be at NAACL please do plan on stopping by during the poster
session and visiting. Details on the schedule are available here :
http://naaclhlt2010.isi.edu/  If you won't be but are interested in
the issue, please check out the paper and let me know if you have any
questions, or aren't sure how to proceed in getting your own info
content files. I'd be happy to help.

Cordially,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to