Re: [Scikit-learn-general] evaluating learning algorithms

Olivier Grisel Sat, 01 Oct 2011 05:50:17 -0700

2011/10/1 mathieu lacage <[email protected]>:
> hi,
>
> I am looking for advice on how to pick a classifier among n competing
> classifiers when they are evaluated on more than a single training/test data
> set. i.e., I would like to compare, for each classifier, the set of roc
> curves that are generated from each training/test data set.  Is there an
> established way of doing this ?


There is no such high level tool in scikit-learn. You will probably
have to script it yourself.

Here are a couple of notes on related utilities in the scikit:

- sklearn.grid_search.GridSearchCV can be handy to find a good
parameter set for a given estimator instance but does not allow to
compare different instances together.

- sklearn.cross_validation.cross_val_score does accept a score_func
argument unfortunately (such as the `zero_one_score` or `f1_score` for
classification) but it won't work with the `auc` function that compute
the area under the ROC curve since the arguments are not the same.

You could however overcome those limitations by building your own
MetaClassifier class that inherit BasteEstimator and take a classifier
as constructor parameter, implement the fit and predict method by
delegating to it and then implement the `score` method by using the
`predict_proba` method of the classifier and the `sklearn.metrics.auc`
function.

With such a meta classifier and the GridSearchCV tool you would be
able to model selection at the algorithm level using the area under
ROC curve as the selection criterion.

Whatch out of the current inconsistencies in the predict_proba implementations
in the scikit though (see the recent threads on the topic).

For reference:

http://scikit-learn.sourceforge.net/stable/model_selection.html
http://scikit-learn.sourceforge.net/stable/modules/classes.html#metrics

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] evaluating learning algorithms

Reply via email to