Am 04.12.2012 12:20, schrieb Olivier Grisel: > 2012/12/4 Philipp Singer <[email protected]>: >>> It's probably better to train a linear classifier on the text features >>> alone and a second (potentially non linear classifier such as GBRT or >>> ExtraTrees) on the predict_proba outcome of the text classifier + your >>> additional low dim features. >>> >>> This is some kind of stacking method (a sort of ensemble method). It >>> should make the text features not overwhelm the final classifier if >>> the other features are informative. >> Hey Olivier! >> >> Thanks for the hints. I just tried it, but unfortunately the results are >> much worse than just using my textual features alone. >> >> just to be sure if I am doing it right: >> >> At first I create my textual features using a vectorizer. Then I fit a >> linear SVC on these features (training data ofc) and use predict_proba >> for my training samples again resulting in a probability distribution of >> dimension 7 (I have 7 classes). >> >> Then I append my additional features (those are 15) and fit another >> classifier on the new data. (I tried several scaling/normalizing ideas >> without improvement) >> >> I do the same procedure for test data. (Btw I do cross val) >> >> While I get 0.85 f1 score for just using textual data the combined >> approach results in only 0.4. > Have you scaled your additional features to the [0-1] range as the > probability features from the text classifier? > > If you do a full grid search of the SVC hyperparameters (e.g. kernel > linear or rbf and C + gamma for RBF only) there is no reason that the > stacked model could be worth than the original text classifier (unless > you have very few samples and that the additional features are pure > noise). Can't the stacked model be worse because of overfitting issues? I guess if you include a linear SVM, it might be able to learn the identity and be as good as the original classifier. With only RBF-SVM, I'm not sure this is possible.
But testing just a linear SVM should definitely not make things worse if the grid search is done correctly. ------------------------------------------------------------------------------ LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
