2011/11/16 SK Sn <[email protected]>:
> Hi there,
>
> I experienced abnormal behaviors of RidgeClassifier in context of text
> classification.
>
> Test setup: ~800 documents, ~2500 features, 15 classes, scikit-learn dev
> version (version few days ago), classification with KFold.
> Problem:
> When RidgeClassifier is tested, different results (f1,precision,recall) are
> generated when X is in different formats, i.e. scipy.sparse
> vs ( numpy.ndarray(by toarray) or numpy.matrixlib.defmatrix.matrix(by
> todense) ).

You should never use dense matrices: either scipy.sparse or numpy
arrays. For text data, you should probably stick to estimators that
work on scipy.sparse input.

> The difference of results (f1/precision/recall) between X sparse and
> (X.todense() or X.array()) are about -0.5% to +1.0%.

Always use X.toarray() if you really need to materialize a dense
representation of a sparse dataset. X.todense() is a trap.

> Tests:
> Tested in full feature scenario, feature selection scenario, and parts of
> classes scenario, this difference all occurs.
> Other classifiers that can operate on scipy.sparse are tested, none of them
> have this problem. Namely, kNN, Naive Bayes, LinearSVC, SGDClassifier.

Can you please provide a minimalistic reproduction script that
highlight the issue as a gist (see http://gist.github.com )? Maybe
using the 20 newsgroups dataset for instance.

As for decision trees, I think it's normal that a single tree gives
bad results. The future Random Forest implementation should improve
upon that but I don't think the current code base supports sparse data
as input (as is the case for dense data).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to