Thanks! That makes a lot of sense. I hadn't thought to use binary count
with a count vectorizer.

On Thu, May 14, 2015 at 4:15 PM, Andreas Mueller <t3k...@gmail.com> wrote:

>  You just want df, right? So that is binary CountVectorizer counts.
> This will likely give you a lot of garbage [typos and odd spellings]
> unless your text is very clean or your tokenizer is very good,
> or you ran it through a spell checker etc.
>
>
> On 05/14/2015 04:03 PM, Adam Goodkind wrote:
>
> Hi,
>
>  This question might be too general, or opinion-based, but I was looking
> for advice on how to extract rare words from a very large corpus. The rare
> words wouldn't necessarily be consistent from document to document, so
> traditional tf-idf wouldn't be quite right.
>
>  Specifically, I'm looking at restaurant reviews, and want to highlight
> reviews that use very specific language, e.g. "The steak tartare tasted
> like cotton candy" vs "The food was good". The fact that the sentiment was
> negative or positive is not as important.
>
>  Using a binary tf-idf seems to help, as it doesn't overweight the fact
> that a word was used multiple times in a single document. Is there any
> other advice as to how this can be detected?
>
>  Thanks,
> Adam
>
>  --
>  *Adam Goodkind *
> adamgoodkind.com <http://www.adamgoodkind.com>
> @adamgreatkind <https://twitter.com/#%21/adamgreatkind>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM 
> Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>
>
>
> _______________________________________________
> Scikit-learn-general mailing 
> listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
*Adam Goodkind *
adamgoodkind.com <http://www.adamgoodkind.com>
@adamgreatkind <https://twitter.com/#!/adamgreatkind>
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to