Hi Paul
Tuning min_samples_split is a good idea but not related to imbalanced 
classes.
First, you should specify what you want to optimize. Accuracy is usually
not a good measure for imbalanced classes. Maybe F-score?

If you want your confusion matrix to be more balanced, you can try two
things (as class weights are not implemented yet afaik):
- set a different decision threshold: classify all as positive that have 
a probability of being positive of over .20 (for example).
- stratify the dataset, meaning make it such that there is the same 
number of samples from both classes.
   You can do that by either throwing away samples from the larger class 
(undersampling) or duplicating samples from
   the smaller class (oversampling).

hth,
Andy


On 11/07/2012 08:18 AM, [email protected] wrote:
> Dear SciKitters,
>
> given a dataset of 622 samples and 177 features each, I want to classify
> those given an experimental classification stating "0" or "1".
>
> After splitting up into training and test set, I trained a RandomForest the
> following way:
> "
> from sklearn.ensemble import RandomForestClassifier
> clf_RF = RandomForestClassifier(n_estimators=20,
> max_depth=None,random_state=0,n_jobs=1)
> clf_RF = clf_RF.fit(X_train,y_train)
> y_predict = clf_RF.predict(X_test)
> accuracy  = clf_RF.score(X_test,y_test)
> fpr, tpr, thresholds = metrics.roc_curve(y_test, y_predict)
> print metrics.confusion_matrix
> (y_test,y_predict),"\n",accuracy,"\n",metrics.auc(fpr,tpr)
> "
> which gives
> "
> [[161  12]
>   [ 51  25]]
> 0.746987951807
> 0.629791603286
> "
>
> Yes, this data set is rather unbalanced, and I was told to tune the
> min_samples_split
> (http://www.mail-archive.com/[email protected]/msg04999.html)
>
> For this purpose, I applied a GridSearchCV on min_samples_split
> "
> tuned_parameters = [{'min_samples_split':
> [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]}]
> clf_RF_gridsearched = GridSearchCV(RandomForestClassifier
> (),tuned_parameters,n_jobs=1)
> clf_RF_gridsearched = clf_RF_gridsearched.fit
> (X_train,y_train,cv=5,n_jobs=1)
> y_true, y_pred = y_test, clf_RF_gridsearched.predict(X_test)
> print classification_report(y_true, y_pred)
> print metrics.confusion_matrix(y_true,y_pred)
> print clf_RF_gridsearched.best_estimator_
> "
> outputting this statistics/settings:
> "
>               precision    recall  f1-score   support
>
>            0       0.74      0.94      0.83       173
>            1       0.67      0.26      0.38        76
>
> avg / total       0.72      0.73      0.69       249
>
> [[163  10]
>   [ 56  20]]
> RandomForestClassifier(bootstrap=True, compute_importances=False,
>              criterion=gini, max_depth=None, max_features=auto,
>              min_density=0.1, min_samples_leaf=1, min_samples_split=9,
>              n_estimators=10, n_jobs=1, oob_score=False,
>              random_state=<mtrand.RandomState object at 0x7f8cc411d2d0>,
>              verbose=0)
> "
>
> Not much of an improvement.
> Did I approach the problem in a wrong way?
> Or is the given dataset one of the tough ones?
>
>
> Cheers & Thanks,
> Paul
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.merckgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>
>
> ------------------------------------------------------------------------------
> LogMeIn Central: Instant, anywhere, Remote PC access and management.
> Stay in control, update software, and manage PCs from one command center
> Diagnose problems and improve visibility into emerging IT issues
> Automate, monitor and manage. Do more in less time with Central
> http://p.sf.net/sfu/logmein12331_d2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
LogMeIn Central: Instant, anywhere, Remote PC access and management.
Stay in control, update software, and manage PCs from one command center
Diagnose problems and improve visibility into emerging IT issues
Automate, monitor and manage. Do more in less time with Central
http://p.sf.net/sfu/logmein12331_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to