Thanks. Completely forgot to follow-up on this. The tip by Michael worked
perfectly:
> Forget this comment, it actually works, because RFE itself also does the '__'
> thing. You need to use 'sel__estimator__C' instead of 'sel__SVC__C'
Just ran a quick test on a simple toy dataset (Wine from UCI). Here, I
grid-searched for SVM parameters and 1-4 features via SelectKBest vs. RFE. Both
yielded different feature subsets (both selected 4 features though).
SelectKBest -> avg. ROC accuracy: 0.94
RFE workaround -> avg. ROC accuracy: 0.96
Maybe it would be worthwhile adding an example to the doc for how to do
recursive feature elimination in GridSearch?
I have to mention that wine is probably not the best dataset, since all
features are somewhat informative and the best selected feature subset would be
k = d so that running a gridsearch for the number of features would probably
always yield the highest number of features.
Btw. the code would be
### RFE approach
pipeline = Pipeline([('scl', StandardScaler()),
('sel', RFE(estimator=SVC(kernel='linear',
random_state=1), step=1))])
param_grid = [{'sel__n_features_to_select': [1, 2, 3, 4],
'sel__estimator__C': [0.1, 1.0, 10.0, 100.0],
'sel__estimator__kernel': ['linear']}]
grid_search = GridSearchCV(pipeline,
param_grid=param_grid,
verbose=1,
cv=StratifiedKFold(y, n_folds=10),
scoring='accuracy',
n_jobs=1)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
print(grid_search.best_score_)
### SelectKBest approach
pipeline = Pipeline([('scl', StandardScaler()),
('sel', SelectKBest()),
('clf', SVC(kernel='linear', random_state=1))])
param_grid = [{'sel__k': [1, 2, 3, 4],
'clf__C': [0.1, 1, 10, 100],
'clf__kernel': ['linear']}]
grid_search = GridSearchCV(pipeline,
param_grid=param_grid,
verbose=1,
cv=StratifiedKFold(y, n_folds=10),
scoring='accuracy',
n_jobs=1)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
print(grid_search.best_score_)
Best,
Sebastian
> On Feb 17, 2015, at 6:54 PM, Andy <[email protected]> wrote:
>
>
> On 02/13/2015 01:04 AM, Sebastian Raschka wrote:
>>
>> Both terms of speed and performance, I think it depends on what SelectKBest
>> is doing :).
> It uses a simple anova test, which is independent for each feature. It
> does not build any kind of model at all, so it is very cheap.
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general