Aaron Schon wrote:
...I was wondering if taking a bag of words approach might work. For example chunking the sentences to be analyzed and running a Lucene query against an index storing sentiment polarity. Has anyone had success with this approach? I do not need a super accurate system, something that is "reasonably" accurate.
Even the best sentiment analyzers aren't that good. And they need to be trained per domain (e.g. "easy to use" is good for electronics and "leaky" is bad, but your mileage varies in other domains, where "fuel efficient" or "entertaining" might be good). You'll take a hit in performance using a bag of words (or stemmed, stoplisted, case-normalized terms) because you lose subword generalizations if the stemmer's not great or if word segmentation varies, and you'll lose cross-word discriminitive power going a word at a time. Using TF/IDF to weight the terms can help.
Also, could you suggest good publicly available training datasets? I am aware of the Cornell Movie Reviews dataset[1]
The Pang and Lee data from Cornell was collected automatically from Rotten Tomatoes and IMDB. Gathering more data like that from Amazon, C-net, etc. should be easy. That's what everyone's doing for evaluations. But these are all at the review level, not at the sentence level. We've actually had customers annotating at the sentence level, which can produce much tighter training sets. - Bob Carpenter Alias-i --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]