Hi.
Can you give a bit more details on 3 and 4?
And can you give an example use case?
When do you need scorers and out of bag samples? The scorers are used in GridSearchCV and cross_val_score, but the out of bag samples basically replace cross validation,
so I don't quite understand how these would work together.

I think it would be great if you could give a use-case and some (pseudo) code on how it would look with your favourite solution.

Cheers,
Andy

On 10/26/2014 10:33 PM, Aaron Staple wrote:
Greetings sklearn developers,

I’m a new sklearn contributor, and I’ve been working on a small project to allow customization of the scoring metric used when scoring out of bag data for random forests (see https://github.com/scikit-learn/scikit-learn/pull/3723). In this PR, @mblondel and I have been discussing an architectural issue that we would like others to weigh in on.

While working on my implementation, I’ve run into a bit of difficulty using the scorer implementation as it exists today - in particular, with the interface expressed in _BaseScorer. The current _BaseScorer interface is callable, accepting an estimator (utilized as a Predictor), along with some prediction data points X, and returning a score. The various _BaseScorer implementations compute a score by calling estimator.predict(X), estimator.predict_proba(X), or estimator.decision_function(X) as needed, possibly applying some transformations to the results, and then applying a score function.

The issue I’ve run into is that predicting out of bag samples is a rather specialized procedure because the model used differs for each training point, based on how that point was used during fitting. Computing these predictions is not particularly suited for implementation as a Predictor. In addition, in the PR we’ve been discussing that idea that a random forest estimator will make its out of bag predictions available as attributes, allowing a user of the estimator to subsequently score these provided predictions. Also, @mblondel mentioned that for his work on multiple-metric grid search, he is interested in scoring predictions he computes outside of a Predictor.

The difficulty is that the current scorers take an estimator and data points, and compute predictions internally. They don’t accept externally computed predictions.

I’ve written up a series of different generalized options for implementing a system of scoring externally computed predictions (some are likely undesirable but are provided as points of comparison):

1) Add a new implementation that’s completely separate from the existing _BaseScorer class.

2) Use the existing _BaseScorer without changes. This means abusing the Predictor interface and creating something like a dummy predictor that ignores X and returns the externally computed predictions - predictions not inherently based on the X variable, but which were externally computed based on a known X value.

3) Add a private api to _BaseScorer for scoring externally computed predictions. The private api can be called by a public helper function in scorer.py.

4) Change the public api of _BaseScorer to make scoring of externally computed predictions a public operation along with the existing functionality. Also possibly rename _BaseScorer => BaseScorer.

5) Change the public api of _BaseScorer so that it only handles externally computed predictions. The existing functionality would be implemented by the caller (as a callback, since the required type of prediction data is not known by the caller).

So far in the PR we’ve been looking at options 2, 3, and 4, with 4 seeming like a good candidate. Once we decide on one of these options, I’d like to follow up with stakeholders on the specifics of what the new interface will look like.

Thanks,
Aaron Staple


------------------------------------------------------------------------------


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to