Re: [Scikit-learn-general] tf-idf changes

Olivier Grisel Mon, 26 Mar 2012 13:14:29 -0700

Le 26 mars 2012 14:14, Jaques Grobler <[email protected]> a écrit :
> Hi everyone-
>
> I stumbled upon this post that offers a quick run-trough of
> text-feature-extraction using
> sklearn.feature_extraction.text's CountVectorizer:
>
>
> http://pyevolve.sourceforge.net/wordpress/?p=1589&cpage=1#comment-15857
>
> Upon copying the code into ipython,  i get different outputs from him. It
> appears as though there have been
> changes to this module since he made this post, but I don't see anything in
> the change-log, unless i'm missing it.


The module has been completely refactored in master as stated in the changelog:

http://scikit-learn.org/dev/whats_new.html

> Just want to give the guy a heads-up about it. Can anyone point me in a
> direction or help here?

In particular the IDF smoothing used to cause negative values for
highly frequent words (very rare in practice). The fix that I used
makes all IDF values larger than 1.0. This might not be as canonical
as it should be but I tried other alternatives and they tended to
decrease the quality of the KMeans results in the text clustering
example...

Also the fitted vocabulary has been renamed to vocabulary_ to respect
the fit semantics of the rest of the project.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] tf-idf changes

Reply via email to