On Fri, Feb 13, 2015 at 10:04 AM, Sebastian Raschka <se.rasc...@gmail.com>
wrote:
> > It has it's own scoring function.
>
> Is this documented somewhere? I only found "Select features according to
> the k highest scores." (at
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)
> which could maybe a little bit extended.
>
SelectKBest is no more specific than indicated in its docstring. It
requires a function that attributes a score to each feature and then
selects the k best scores. By default this function is
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html
and in general it is better to provide this explicitly in order to know how
the selection process actually takes place. In this specific case an ANOVA
by default. If you have a regression target, you need to change it to
f_regression for example.
>
> >> Alternative, I would use recursive feature selection like so:
> >
> > You could. It would be much slower, and I am not convinced it would work
> > better.
>
> Both terms of speed and performance, I think it depends on what
> SelectKBest is doing :).
see above, univariate selection according to what selector you provide.
> I think a greedy backward selection would have the advantage that the
> "best" features are selected with respect to the classifier performance.
> This could make a significant difference e.g., for non-linear data
> depending on how SelectKBest works.
>
> I was conceptually thinking of something like this
>
>
> pipeline = Pipeline([
> ('scl', StandardScaler()),
> ('sel', RFE(estimator=SVC(kernel='linear',
> random_state=1), step=1))])
>
> param_grid = [{'sel_n_features_to_selct': [1, 2, 3, 4],
> 'sel__SVC__C': [0.1, 1, 10, 100],
> 'sel__SVC__kernel': ['linear']}]
>
> grid_search = GridSearchCV(pipeline,
> param_grid=param_grid,
> verbose=1,
> cv=StratifiedKFold(y, n_folds=10),
> scoring='accuracy',
> n_jobs=1)
>
> grid_search.fit(X, y)
> print(grid_search.best_estimator_)
> print(grid_search.best_score_)
>
> Btw. is there a way to provide an argument to such a nested pipeline like
> that (e.g., here the SVC) in general?
>
>
AFAIK, and according to this code line
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py#L116
the pipeline only supports exactly one split of the parameter name, namely
at the first occurence of '__'. So in its current state this type of
recursion is impossible.
Hope that helps,
Michael
>
> Best,
> Sebastian
>
>
> > On Feb 13, 2015, at 3:34 AM, Gael Varoquaux <
> gael.varoqu...@normalesup.org> wrote:
> >
> > On Fri, Feb 13, 2015 at 03:31:54AM -0500, Sebastian Raschka wrote:
> >> I am wondering how SelectKBest determines what the "best" set of
> >> features is since it happens before they are fed to the classifier.
> >> Does it have it's own "scoring" function or does it use the classifier
> >> from the last fit?
> >
> > It has it's own scoring function.
> >
> >> Alternative, I would use recursive feature selection like so:
> >
> > You could. It would be much slower, and I am not convinced it would work
> > better. You could try both, and tell us your experience :).
> >
> > Cheers,
> >
> > Gaƫl
> >
> >
> ------------------------------------------------------------------------------
> > Dive into the World of Parallel Programming. The Go Parallel Website,
> > sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> > hub for all things parallel software development, from weekly thought
> > leadership blogs to news, videos, case studies, tutorials and more. Take
> a
> > look and join the conversation now. http://goparallel.sourceforge.net/
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general