Re: [Scikit-learn-general] Feature selection

Sebastian Raschka Thu, 28 May 2015 09:22:55 -0700

I agree with Andreas,
typically, a large number of features also shouldn't be a big problem for 
random forests in my experience; however, it of course depends on the number of 
trees and training samples.


If you suspect that overfitting might be a problem using unregularized 
classifiers, also consider "dimensionality reduction"/"feature exctraction" 
techniques to compress the feature space, e.g., linear or kernel PCA, or other 
methods listed in the manifold learning section on the scikit-website.

However, there are scenarios where you'd want to keep the "original" features 
(in contrast to e.g., principal components), and there are scenarios where 
linear methods such as LinearSVC(penalty='l1') may not work so well (e.g., for 
non-linear problems). The optimal solution would be to exhaustively test all 
feature combinations to see which works best, however, this can be quite 
costly. For demonstration purposes, I implemented "sequential backward 
selection" 
(http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/ 
<http://rasbt.github.io/mlxtend/docs/sklearn/sequential_backward_selection/>) 
some time ago; a simple greedy alternative to the exhaustive search, maybe you 
are lucky and it works well in your case? . When I find time after my summer 
projects, I am planning to implement some genetic algos for feature 
selection... 

Best,
Sebastian


> On May 28, 2015, at 11:59 AM, Andreas Mueller <t3k...@gmail.com> wrote:
> 
> Hi Herbert.
> 1) Often reducing the features space does not help with accuracy, and using a 
> regularized classifier leads to better results.
> 2) To do feature selection, you need two methods: one to reduce the set of 
> features, another that does the actual supervised task (classification here).
> 
> Have you tried just using the standard classifiers? Clearly you tried the RF, 
> but I'd also try a linear method like LinearSVC/LogisticRegression or a 
> kernel SVC.
> 
> If you want to do feature selection, what you need to do is something like 
> this:
> 
> feature_selector = LinearSVC(penalty='l1')  #or maybe start with SelectKBest()
> feature_selector.train(X_train, y_train)
> 
> X_train_reduced = feature_selector.transform(X_train)
> X_test_reduced = feature_selector.transform(X_test)
> 
> classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
> 
> prediction = classifier.predict(X_test_reduced)
> 
> 
> Or you use a pipeline, as here: 
> http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
>  
> <http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html>
> Maybe we should add a version without the pipeline to the examples?
> 
> Cheers,
> Andy
> 
> 
> 
> On 05/28/2015 08:32 AM, Herbert Schulz wrote:
>> Hello,
>> I'm using scikit-learn for machine learning.
>> I have 800 samples with 2048 features, therefore i want to reduce my 
>> features to get hopefully a better accuracy. 
>> 
>> It is a multiclass problem (class 0-5), and the features consists of 1's and 
>> 0's:  [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
>> 
>> I'm using the Randfom Forest Classifier.
>> 
>> Should i just feature select the training data ? And is it enough if I'm 
>> using this code:
>> 
>>     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)
>> 
>>     
>> clf=RandomForestClassifier(n_estimators=200,warm_start=True,criterion='gini',
>>  max_depth=13)
>>     clf.fit(X_train, y_train).transform(X_train)
>> 
>>     predicted=clf.predict(X_test)
>>     expected=y_test
>>     confusionMatrix=metrics.confusion_matrix(expected,predicted)
>> 
>> Cause the accuracy didn't get higher. Is everything ok in the code or am I 
>> doing something wrong?
>> 
>> I'll be very grateful for your help.
>> 
>> 
>> 
>> 
>> 
>> ------------------------------------------------------------------------------
>> 
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net 
>> <mailto:Scikit-learn-general@lists.sourceforge.net>
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
>> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Feature selection

Reply via email to