You just want df, right? So that is binary CountVectorizer counts.
This will likely give you a lot of garbage [typos and odd spellings] unless your text is very clean or your tokenizer is very good,
or you ran it through a spell checker etc.

On 05/14/2015 04:03 PM, Adam Goodkind wrote:
Hi,

This question might be too general, or opinion-based, but I was looking for advice on how to extract rare words from a very large corpus. The rare words wouldn't necessarily be consistent from document to document, so traditional tf-idf wouldn't be quite right.

Specifically, I'm looking at restaurant reviews, and want to highlight reviews that use very specific language, e.g. "The steak tartare tasted like cotton candy" vs "The food was good". The fact that the sentiment was negative or positive is not as important.

Using a binary tf-idf seems to help, as it doesn't overweight the fact that a word was used multiple times in a single document. Is there any other advice as to how this can be detected?

Thanks,
Adam

--
*Adam Goodkind *
adamgoodkind.com <http://www.adamgoodkind.com>
@adamgreatkind <https://twitter.com/#%21/adamgreatkind>


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to