Re: [Scikit-learn-general] Clarification on SelectKBest usage in pipelines & GridSearch

Sebastian Raschka Tue, 17 Feb 2015 17:42:06 -0800

Thanks. Completely forgot to follow-up on this. The tip by Michael worked 
perfectly:


> Forget this comment, it actually works, because RFE itself also does the '__' 
> thing. You need to use 'sel__estimator__C' instead of 'sel__SVC__C'

Just ran a quick test on a simple toy dataset (Wine from UCI). Here, I 
grid-searched for SVM parameters and 1-4 features via SelectKBest vs. RFE. Both 
yielded different feature subsets (both selected 4 features though).

SelectKBest -> avg. ROC accuracy: 0.94
RFE workaround -> avg. ROC accuracy: 0.96

Maybe it would be worthwhile adding an example to the doc for how to do 
recursive feature elimination in GridSearch?

I have to mention that wine is probably not the best dataset, since all 
features are somewhat informative and the best selected feature subset would be 
k = d so that running a gridsearch for the number of features would probably 
always yield the highest number of features. 


Btw. the code would be


### RFE approach

pipeline = Pipeline([('scl', StandardScaler()),
                     ('sel', RFE(estimator=SVC(kernel='linear', 
random_state=1), step=1))])

param_grid = [{'sel__n_features_to_select': [1, 2, 3, 4], 
               'sel__estimator__C': [0.1, 1.0, 10.0, 100.0], 
               'sel__estimator__kernel': ['linear']}]

grid_search = GridSearchCV(pipeline, 
                           param_grid=param_grid, 
                           verbose=1, 
                           cv=StratifiedKFold(y, n_folds=10), 
                           scoring='accuracy', 
                           n_jobs=1)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
print(grid_search.best_score_)



### SelectKBest approach

pipeline = Pipeline([('scl', StandardScaler()),
                     ('sel', SelectKBest()), 
                     ('clf', SVC(kernel='linear', random_state=1))])

param_grid = [{'sel__k': [1, 2, 3, 4], 
               'clf__C': [0.1, 1, 10, 100], 
               'clf__kernel': ['linear']}]

grid_search = GridSearchCV(pipeline, 
                           param_grid=param_grid, 
                           verbose=1, 
                           cv=StratifiedKFold(y, n_folds=10), 
                           scoring='accuracy', 
                           n_jobs=1)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
print(grid_search.best_score_)



Best,
Sebastian


> On Feb 17, 2015, at 6:54 PM, Andy <t3k...@gmail.com> wrote:
> 
> 
> On 02/13/2015 01:04 AM, Sebastian Raschka wrote:
>> 
>> Both terms of speed and performance, I think it depends on what SelectKBest 
>> is doing :).
> It uses a simple anova test, which is independent for each feature. It 
> does not build any kind of model at all, so it is very cheap.
> 
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Clarification on SelectKBest usage in pipelines & GridSearch

Reply via email to