Aaron Schon wrote:
...I was wondering if taking a bag of words approach might work. For example chunking the 
sentences to be analyzed and running a Lucene query against an index storing sentiment 
polarity. Has anyone had success with this approach? I do not need a super accurate 
system, something that is "reasonably" accurate.

Even the best sentiment analyzers aren't that good.

And they need to be trained per domain (e.g. "easy to
use" is good for electronics and "leaky" is bad, but
your mileage varies in other domains, where "fuel efficient"
or "entertaining" might be good).

You'll take a hit in performance using a bag of
words (or stemmed, stoplisted, case-normalized terms)
because you lose subword generalizations if the stemmer's
not great or if word segmentation varies, and you'll lose
cross-word discriminitive power going a word at a time.

Using TF/IDF to weight the terms can help.

Also, could you suggest good publicly available training datasets? I am aware 
of the Cornell Movie Reviews dataset[1]

The Pang and Lee data from Cornell was collected automatically
from Rotten Tomatoes and IMDB.  Gathering more data like that
from Amazon, C-net, etc. should be easy.  That's what everyone's
doing for evaluations.

But these are all at the review level, not at the sentence
level.  We've actually had customers annotating at the sentence
level, which can produce much tighter training sets.

- Bob Carpenter
  Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to