Dear Andreas,
Dear Gilles,
Dear SciKitters,
> Hi Paul
> Tuning min_samples_split is a good idea but not related to imbalanced
> classes.
> First, you should specify what you want to optimize. Accuracy is usually
> not a good measure for imbalanced classes. Maybe F-score?
How would one do that?
I just tried this:
"
scores = [
('f1-score', f1_score),
('recall', recall_score),
]
for score_name, score_func in scores:
print "# Tuning hyper-parameters for %s" % score_name
print
clf_RF_gridsearched = GridSearchCV(RandomForestClassifier
(),tuned_parameters,score_func=score_func,n_jobs=20)
clf_RF_gridsearched = clf_RF_gridsearched.fit
(X_train,y_train,cv=5,n_jobs=20)
print "Best parameters set found on development set:"
print
print clf_RF_gridsearched.best_estimator_
print
print "Grid scores on development set:"
print
for params, mean_score, scores in clf_RF_gridsearched.grid_scores_:
print "%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2,
params)
print
print
print "Detailed classification report:"
print
print "The model is trained on the full development set."
print "The scores are computed on the full evaluation set."
print
y_true, y_pred = y_test, clf_RF_gridsearched.predict(X_test)
print classification_report(y_true, y_pred)
print metrics.confusion_matrix(y_true,y_pred)
print
"
However, f1_score is not found. I would have suspected that this works in
analogy to the recall_score.
>
> If you want your confusion matrix to be more balanced, you can try two
> things (as class weights are not implemented yet afaik):
> - set a different decision threshold: classify all as positive that have
> a probability of being positive of over .20 (for example).
> - stratify the dataset, meaning make it such that there is the same
> number of samples from both classes.
Indeed, undersampling gives better performance for the class being
underrepresented. However, I have been doing this in a pre-processing step
outside sklearn.
=> How to do this within sklearn?
Cheers,
Paul
This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.
Click http://www.merckgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
------------------------------------------------------------------------------
LogMeIn Central: Instant, anywhere, Remote PC access and management.
Stay in control, update software, and manage PCs from one command center
Diagnose problems and improve visibility into emerging IT issues
Automate, monitor and manage. Do more in less time with Central
http://p.sf.net/sfu/logmein12331_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general