Hi there,
(This may be a duplication. The same message sent two hours ago seems lost
by the mailing server, so I resend.)
I experienced abnormal behaviors of RidgeClassifier in context of text
classification.
*Test setup*: ~800 documents, ~2500 features, 15 classes, scikit-learn dev
version (version few days ago), classification with KFold.
*Problem*:
When RidgeClassifier is tested, different results (f1,precision,recall) are
generated when X is in different formats, i.e. scipy.sparse vs (
numpy.ndarray(by toarray) or numpy.matrixlib.defmatrix.matrix(by todense) ).
The difference of results (f1/precision/recall) between X sparse and
(X.todense() or X.array()) are about -0.5% to +1.0%.
*Tests*:
Tested in full feature scenario, feature selection scenario, and parts of
classes scenario, this difference all occurs.
Other classifiers that can operate on scipy.sparse are tested, none of them
have this problem. Namely, kNN, Naive Bayes, LinearSVC, SGDClassifier.
So, I reckon, this may be a bug in Ridge itself. Does anyone know which
result, sparse one or toarray/todense one, is the correct one that I should
consider as my result?
Another question about how to use tree classifier: in the experiment
setting mentioned above, I get results, say f1 scores, around 83%-90% using
different kinds of classifiers mentioned above with parameter tunning.
However, when I tried tree classifier, my results are always below 65%.
I tried to tune different parameters but never got substantial improvement.
I tried to look into few textbook and paper, but still could not figure out
in practice what should I do to get similar results from tree classifier
compared to other classifiers.
Could you please shed some light on using trees with high-dimensional data,
or refer me to a practical guide about tree classifiers. Any help would be
appreciated!
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general