Re: [Scikit-learn-general] design of scorer interface

Mathieu Blondel Fri, 28 Nov 2014 02:40:59 -0800

On Fri, Nov 28, 2014 at 5:14 PM, Aaron Staple <aaron.sta...@gmail.com>
wrote:


> [...]
> However, I tried to run a couple of test cases with 0-1 predictions for
> RidgeCV and classification with RidgeClassifierCV, and I got some error
> messages. It looks like one reason for this is that
> LinearModel._center_data can convert the y values to non integers. In
> addition, it appears that in the case of multiclass classification the
> scorer is applied to the ravel()’ed list of one-vs-all classifiers and not
> to the actual class predictions. Am I right in thinking that this can
> affect the classification score for some scorers? For example, consider a
> simple accuracy scorer and just one prediction. It is possibly for some
> one-vs-all classifiers to be predicted correctly while the overall class
> prediction is wrong - thus the accuracy score over the one-vs-all
> classifiers would be nonzero while the overall classification accuracy is
> zero. (In addition, if I am reading correctly I believe the y_true and
> y_predicted values are possibly being passed incorrectly to the scorer
> currently, and are being swapped with each other.)
>

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py#L800

Shouldn't this line use the unnormalized y? Otherwise, this is evaluating a
different problem.

BTW, the scorer handling in RidgeCV is currently broken.


>
> Given these observations I wanted to double check 1) that we want to
> support classification scorers and not just regression scorers at this
> precise location in this code and 2) that I should start using get_score in
> this location now, given that I believe at least some additional work will
> be needed for support of classification scorers.
>

I was more talking about ranking scorers.

# y contains binary values
y_pred = RandomForestRegressor().fit(X, y).predict(X)
print roc_auc_score(y, y_pred)

# y contains ordinal values
y_pred = RandomForestRegressor().fit(X, y).predict(X)
print ndcg_score(y, y_pred)  # not yet in scikit-learn

For me these two usecases are perfectly legitimate. Now, I would really
like to use GridSearchCV to tune the RF hyper-parameters against AUC or
NDCG but the scorer API insists on calling either predict_proba or
decision_function.
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/scorer.py#L159

If we could detect that an estimator is a regressor, we could call
"predict" instead but we have currently no way to know that. We can't check
isinstance(estimator, RegressorMixin) since we can't even expect a
third-party regression class to inherent RegressorMixin (as per our current
API "specification").

M.

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] design of scorer interface

Reply via email to