Le 26 mars 2012 14:14, Jaques Grobler <[email protected]> a écrit : > Hi everyone- > > I stumbled upon this post that offers a quick run-trough of > text-feature-extraction using > sklearn.feature_extraction.text's CountVectorizer: > > > http://pyevolve.sourceforge.net/wordpress/?p=1589&cpage=1#comment-15857 > > Upon copying the code into ipython, i get different outputs from him. It > appears as though there have been > changes to this module since he made this post, but I don't see anything in > the change-log, unless i'm missing it.
The module has been completely refactored in master as stated in the changelog: http://scikit-learn.org/dev/whats_new.html > Just want to give the guy a heads-up about it. Can anyone point me in a > direction or help here? In particular the IDF smoothing used to cause negative values for highly frequent words (very rare in practice). The fix that I used makes all IDF values larger than 1.0. This might not be as canonical as it should be but I tried other alternatives and they tended to decrease the quality of the KMeans results in the text clustering example... Also the fitted vocabulary has been renamed to vocabulary_ to respect the fit semantics of the rest of the project. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
