Re: [Scikit-learn-general] Append additional data in pipeline

Andreas Mueller Tue, 04 Dec 2012 03:26:46 -0800

Am 04.12.2012 12:20, schrieb Olivier Grisel:
> 2012/12/4 Philipp Singer <[email protected]>:
>>> It's probably better to train a linear classifier on the text features
>>> alone and a second (potentially non linear classifier such as GBRT or
>>> ExtraTrees) on the predict_proba outcome of the text classifier + your
>>> additional low dim features.
>>>
>>> This is some kind of stacking method (a sort of ensemble method). It
>>> should make the text features not overwhelm the final classifier if
>>> the other features are informative.
>> Hey Olivier!
>>
>> Thanks for the hints. I just tried it, but unfortunately the results are
>> much worse than just using my textual features alone.
>>
>> just to be sure if I am doing it right:
>>
>> At first I create my textual features using a vectorizer. Then I fit a
>> linear SVC on these features (training data ofc) and use predict_proba
>> for my training samples again resulting in a probability distribution of
>> dimension 7 (I have 7 classes).
>>
>> Then I append my additional features (those are 15) and fit another
>> classifier on the new data. (I tried several scaling/normalizing ideas
>> without improvement)
>>
>> I do the same procedure for test data. (Btw I do cross val)
>>
>> While I get 0.85 f1 score for just using textual data the combined
>> approach results in only 0.4.
> Have you scaled your additional features to the [0-1] range as the
> probability features from the text classifier?
>
> If you do a full grid search of the SVC hyperparameters (e.g. kernel
> linear or rbf and C + gamma for RBF only) there is no reason that the
> stacked model could be worth than the original text classifier (unless
> you have very few samples and that the additional features are pure
> noise).
Can't the stacked model be worse because of overfitting issues?
I guess if you include a linear SVM, it might be able to learn the identity
and be as good as the original classifier. With only RBF-SVM,
I'm not sure this is possible.


But testing just a linear SVM should definitely not make things worse
if the grid search is done correctly.

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Append additional data in pipeline

Reply via email to