2011/11/16 SK Sn <[email protected]>: > Hi there, > > I experienced abnormal behaviors of RidgeClassifier in context of text > classification. > > Test setup: ~800 documents, ~2500 features, 15 classes, scikit-learn dev > version (version few days ago), classification with KFold. > Problem: > When RidgeClassifier is tested, different results (f1,precision,recall) are > generated when X is in different formats, i.e. scipy.sparse > vs ( numpy.ndarray(by toarray) or numpy.matrixlib.defmatrix.matrix(by > todense) ).
You should never use dense matrices: either scipy.sparse or numpy arrays. For text data, you should probably stick to estimators that work on scipy.sparse input. > The difference of results (f1/precision/recall) between X sparse and > (X.todense() or X.array()) are about -0.5% to +1.0%. Always use X.toarray() if you really need to materialize a dense representation of a sparse dataset. X.todense() is a trap. > Tests: > Tested in full feature scenario, feature selection scenario, and parts of > classes scenario, this difference all occurs. > Other classifiers that can operate on scipy.sparse are tested, none of them > have this problem. Namely, kNN, Naive Bayes, LinearSVC, SGDClassifier. Can you please provide a minimalistic reproduction script that highlight the issue as a gist (see http://gist.github.com )? Maybe using the 20 newsgroups dataset for instance. As for decision trees, I think it's normal that a single tree gives bad results. The future Random Forest implementation should improve upon that but I don't think the current code base supports sparse data as input (as is the case for dense data). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
