[Scikit-learn-general] Clarification on SelectKBest usage in pipelines & GridSearch

Sebastian Raschka Fri, 13 Feb 2015 00:32:52 -0800

Hi, all, 

I want to include a feature selector in a pipeline that I am feeding to 
GridSearch and I am wondering if I am doing the right thing here. Technically 
it works, but I want to make sure that I am understanding the implementation 
correctly.


E.g., when I am using SelectKBest like so:

################################################
# Example 1

from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn import datasets
from sklearn.cross_validation import StratifiedKFold

iris = datasets.load_iris()
X = iris.data
y = iris.target

pipeline = Pipeline([('scl', StandardScaler()),
                     ('sel', SelectKBest()), 
                     ('clf', SVC(kernel='linear', random_state=1))])

param_grid = [{'sel__k': [1, 2, 3, 4], 
               'clf__C': [0.1, 1, 10, 100], 
               'clf__kernel': ['linear']}]

grid_search = GridSearchCV(pipeline, 
                           param_grid=param_grid, 
                           verbose=1, 
                           cv=StratifiedKFold(y, n_folds=10), 
                           scoring='accuracy', 
                           n_jobs=1)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
print(grid_search.best_score_)

# end example
################################################


I am wondering how SelectKBest determines what the "best" set of features is 
since it happens before they are fed to the classifier. Does it have it's own 
"scoring" function or does it use the classifier from the last fit?

Alternative, I would use recursive feature selection like so:

################################################
# Example 2

from sklearn.feature_selection import RFE

svm = SVC(kernel='linear', random_state=1)
param_grid = [{'n_features_to_select': [1, 2, 3, 4]}]
rfe = RFE(estimator=svm, step=1)

grid_search = GridSearchCV(rfe, 
                           param_grid=param_grid, 
                           verbose=1, 
                           cv=StratifiedKFold(y, n_folds=10), 
                           scoring='accuracy', 
                           n_jobs=1)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
print(grid_search.best_score_)

# end example
################################################

>From my understanding the approach would be different in terms of that the 
>Example 2 is a greedy instead of a exhaustive search (like in SelectKBest), is 
>this correct? Also, in Example 2, I am fitting it to an "untuned" SVM. In a 
>linear case it might not make a huge difference, but say I am using a RBF 
>kernel, is there a way to combine the feature selection with hyperparameter 
>tuning via grid search? Or is Example 1 already the "right" approach to do it? 

Thanks,
Sebastian




------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Clarification on SelectKBest usage in pipelines & GridSearch

Reply via email to