You just want df, right? So that is binary CountVectorizer counts.
This will likely give you a lot of garbage [typos and odd spellings]
unless your text is very clean or your tokenizer is very good,
or you ran it through a spell checker etc.
On 05/14/2015 04:03 PM, Adam Goodkind wrote:
Hi,
This question might be too general, or opinion-based, but I was
looking for advice on how to extract rare words from a very large
corpus. The rare words wouldn't necessarily be consistent from
document to document, so traditional tf-idf wouldn't be quite right.
Specifically, I'm looking at restaurant reviews, and want to highlight
reviews that use very specific language, e.g. "The steak tartare
tasted like cotton candy" vs "The food was good". The fact that the
sentiment was negative or positive is not as important.
Using a binary tf-idf seems to help, as it doesn't overweight the fact
that a word was used multiple times in a single document. Is there any
other advice as to how this can be detected?
Thanks,
Adam
--
*Adam Goodkind *
adamgoodkind.com <http://www.adamgoodkind.com>
@adamgreatkind <https://twitter.com/#%21/adamgreatkind>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general