On Thu, Nov 17, 2011 at 1:23 AM, SK Sn <[email protected]> wrote: > Hi there, > > (This may be a duplication. The same message sent two hours ago seems lost > by the mailing server, so I resend.) > > I experienced abnormal behaviors of RidgeClassifier in context of text > classification. > > Test setup: ~800 documents, ~2500 features, 15 classes, scikit-learn dev > version (version few days ago), classification with KFold. > > Problem: > > When RidgeClassifier is tested, different results (f1,precision,recall) are > generated when X is in different formats, i.e. scipy.sparse vs ( > numpy.ndarray(by toarray) or numpy.matrixlib.defmatrix.matrix(by todense) ). > > The difference of results (f1/precision/recall) between X sparse and > (X.todense() or X.array()) are about -0.5% to +1.0%.
Hi, The algorithms used by the dense version and the sparse one are indeed different, thus it would be not uncommon to get slightly different results with both methods. In particular, the solver used by default on sparse matrices is an iterative process while the dense one uses a closed-formula solution. If you want to have control on what algorithm is used on each case, try explicitly setting parameter 'solver' in RidgeClassifier.fit . There has been some work on accelerating the performance of the sparse RidgeClassifier [0], but it still isn't ready. In particular, the conjugate gradient is known to converge slowly for some matrices (I don't remember the exact criterion) which I hope to accelerate using a preconditioner. I'll probably update the pull request with these ideas in the following weeks. Best, Fabian [0] https://github.com/scikit-learn/scikit-learn/pull/418 > > Tests: > > Tested in full feature scenario, feature selection scenario, and parts of > classes scenario, this difference all occurs. > > Other classifiers that can operate on scipy.sparse are tested, none of them > have this problem. Namely, kNN, Naive Bayes, LinearSVC, SGDClassifier. > > So, I reckon, this may be a bug in Ridge itself. Does anyone know which > result, sparse one or toarray/todense one, is the correct one that I should > consider as my result? > > Another question about how to use tree classifier: in the experiment setting > mentioned above, I get results, say f1 scores, around 83%-90% using > different kinds of classifiers mentioned above with parameter tunning. > However, when I tried tree classifier, my results are always below 65%. > > I tried to tune different parameters but never got substantial improvement. > I tried to look into few textbook and paper, but still could not figure out > in practice what should I do to get similar results from tree classifier > compared to other classifiers. > > Could you please shed some light on using trees with high-dimensional data, > or refer me to a practical guide about tree classifiers. Any help would be > appreciated! > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
