Hi,

This question might be too general, or opinion-based, but I was looking for
advice on how to extract rare words from a very large corpus. The rare
words wouldn't necessarily be consistent from document to document, so
traditional tf-idf wouldn't be quite right.

Specifically, I'm looking at restaurant reviews, and want to highlight
reviews that use very specific language, e.g. "The steak tartare tasted
like cotton candy" vs "The food was good". The fact that the sentiment was
negative or positive is not as important.

Using a binary tf-idf seems to help, as it doesn't overweight the fact that
a word was used multiple times in a single document. Is there any other
advice as to how this can be detected?

Thanks,
Adam

-- 
*Adam Goodkind *
adamgoodkind.com <http://www.adamgoodkind.com>
@adamgreatkind <https://twitter.com/#!/adamgreatkind>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to