2011/11/17 SK Sn <[email protected]>: > @Olivier, the quick reproduction of the error using 20Newsgroups - > https://gist.github.com/1372557 > Also, does it mean, actually, for text classification problems, trees are > used less often?
Probably yes, as simple linear models are often much faster to train and more scalable and most text classification problems are approximately linearly separable (e.g. using non-linear models such as gaussian kernels results in potential over-fitting and much longer training times). Would be interesting to try the new Random Forest though once it's merged though. > @Mathieu, is this the case only for Ridge? kNN, NB, linearSVC do not have > such a behavior. > If for Ridge, different solvers are used, which result should I refer to as > result from Ridge? Ok so if I understand the real issue is: # with .toarray(), results: f1:0.99634, precision 0.99637 # only X (sparse), results: f1:0.99524, precision 0.99526 # All other classifiers (kNN, NB, etc) have consistant results no matter toarray() or not. I wonder it this is not just about rounding errors. Still f1 score > 0.995 is excellent. I would not call that a bug :P -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
