[Scikit-learn-general] Feature selection and cross validation; and identifying chosen features

Timothy Vivian-Griffiths Wed, 11 Feb 2015 13:25:37 -0800

Hi Gilles,

Thank you so much for clearing this up for me. So, am I right in thinking that 
the feature selection is carried for every CV-fold, and then once the best 
parameters have been found, the pipeline is then run on the whole training set 
in order to get the .best_estimator_?


One final thing, I did manage to find out which of the predictors were being 
chosen for the .best_estimator_ but it was not immediately clear how to do it. 
In the end, I isolated them by doing the following:

chosen_predictors = grid2.best_estimator_.steps[0][1].get_support()

This gave me a boolean array which I presume shows the columns of the inputs 
that were used in the final model.

Thanks again,

Tim


> Hi Tim,
> 
> On 9 February 2015 at 19:54, Timothy Vivian-Griffiths
> <vivian-griffith...@cardiff.ac.uk> wrote:
>> 
>> 
>> I have a dataset with a very large number of features (> 22,000), and it
>> taking very long to build and cross validate any models. What I would like
>> to do is to carry out the CV process, but adding in feature extraction at
>> every fold, before a model is built. I have read that it is important not to
>> carry out a feature extraction on the whole dataset first, as this is using
>> information from any samples that would go into the test set when deciding
>> which are the important predictors. I would also like to include a parameter
>> search using the RandomizedSearchCV function into this process as well.
>> 
>> The end result that I would like is an example of the best model that was
>> selected from the parameter search, together with a list of the predictors
>> that were used to build it. An example of some code that I have attempted is
>> as follows:
>> 
>> from __future__ import print_function
>> import numpy as np
>> from sklearn.cross_validation import train_test_split,
>> StratifiedShuffleSplit
>> from sklearn.svm import SVC
>> from sklearn.grid_search import RandomizedSearchCV
>> from sklearn.feature_selection import SelectKBest, chi2
>> from sklearn.pipeline import Pipeline
>> from sklearn.metrics import roc_auc_score
>> from scipy.stats import expon, poisson
>> 
>> inputs      =
>> np.load('combo_pgc2_parameters_p_05_allele_counts_phenotype_array.npy?) # a
>> 7731 by 22567 matrix - too many dimensions
>> targets     = np.load('COMBO_phenotypes.npy')
>> 
>> X_train, X_test, y_train, y_test = train_test_split(inputs, targets)
>> 
>> cv_model    = StratifiedShuffleSplit(y_train, n_iter=4)
>> svm_poly    = SVC(kernel='poly')
>> selector    = SelectKBest(chi2, k=5000)
>> clf         = Pipeline([('feature_selection', selector), ('classification',
>> svm_poly)])
>> param_dists = {'classification__C':expon(scale=100),
>> 'classification__gamma':expon(scale=0.1),
>> 'classification__degree':poisson(mu=1, loc=1)}
>> 
>> if __name__=='__main__':
>>    grid        = RandomizedSearchCV(clf, param_distributions=param_dists,
>> cv=cv_model, n_iter=4, n_jobs=-1)
>>    grid.fit(X_train, y_train)
>>    y_pred = grid.predict(X_test)
>>    score = roc_auc_score(y_test, y_pred)
>>    print(score)
>> 
>> 
>> While this is the best that I can do at the moment, I?m not sure that it
>> doing exactly what I want it to. Is it actually breaking down the data into
>> the folds, then carrying out feature selection, or is it carrying out the
>> feature selection first, then proceeding with the cross-validation?
> 
> Your pipeline is correct. For each train/test pair of folds yielded by
> StratifiedShuffleSplit,
> RandomizedSearchCV will fit your pipeline (which you should see as an
> atomic procedure) on the train fold and then evaluate it on the test
> fold. For each of them, k=5000 features will be selected and then used
> to train an SVC estimator. The combination of hyper-parameters that
> perform best on average over all test folds will then be selected.
> 
>> And
>> finally, if this is indeed doing what I want it to, how do I find out which
>> of the predictors it has chosen to be included in the final model with the
>> best parameters?
> 
> The features that are selected may be different from one train/test
> pair to another. Those that should be selected are the one obtained
> when your retrain your pipeline on the whole training set.
> 
>> 
>> Many thanks,
>> 
>> Tim V-G
>> 
>> ------------------------------------------------------------------------------
>> Dive into the World of Parallel Programming. The Go Parallel Website,
>> sponsored by Intel and developed in partnership with Slashdot Media, is your
>> hub for all things parallel software development, from weekly thought
>> leadership blogs to news, videos, case studies, tutorials and more. Take a
>> look and join the conversation now. http://goparallel.sourceforge.net/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
> 
> 
> 
> ------------------------------
> 
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming. The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net/
> 
> ------------------------------
> 
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> End of Scikit-learn-general Digest, Vol 61, Issue 18
> ****************************************************


------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Feature selection and cross validation; and identifying chosen features

Reply via email to