An implementation, without backwards compatibility or new tests (but with
tests.test_grid_search modified to pass) at
https://github.com/jnothman/scikit-learn/tree/cv-enhanced-results (
c9d45a3444<https://github.com/jnothman/scikit-learn/commit/c9d45a3444e6474901dca13eaeaef83b708bd969>).
I hope you like the general idea; I'm not sure you'll like the current
attribute names...
- Joel
On Tue, Mar 12, 2013 at 2:15 PM, Joel Nothman
<[email protected]>wrote:
> On Mon, Mar 11, 2013 at 11:24 AM, Joel Nothman <
> [email protected]> wrote:
>
>> On 03/10/2013 16:42:44 +0100, Andreas Mueller wrote:
>>>
>>> As an aside: if you had all fitted estimators, it would also be quite
>>> easy to compute the other scores, right?
>>> Would that be an acceptable solution for you?
>>>
>>
>> I guess so (noting that a modified scorer with the above could store the
>> estimator as well)... Perhaps that's a reasonable option -- its main
>> benefit over the above is less obfuscation -- though I to worry that
>> storing all estimators in the general case is expensive.
>>
>
> I should note that getting all fitted estimators back means I would still
> need reproduce all the test folds (let's hope there's no randomisation in
> there or that's impossible) and perform prediction to get the stats I want.
>
> Just thinking a moment more about whether there is an elegant solution,
> particularly considering the following cases:
> * storing the estimator
> * storing a summary of the estimator's model, e.g. sparsity produced by
> regularisation
> * storing multiple metrics (e.g. PRF or per-class metrics), with one to be
> used as the primary score
>
> Looking at this generically, what we want to be stored in cv_scores_ (or
> perhaps cv_results_; or grid_results_ and cv_results_ separately) is a
> collection of named numpy arrays indexed by [grid_index] (like parameters
> and aggregate scores currently, though they're now in lists, not numpy
> arrays), and another set indexed by [grid_index, fold_no] (like
> cross-validation scores).
>
> To facilitate this, let's assume that fit_grid_point returns a dict (or
> equivalently an object with a __dict__), such as training/testing
> score/time and number of samples. In GridSearchCV we can easily create an
> array of all folds for each key (assuming the set of keys is constant
> across folds). We can also then produce the aggregate scores and select the
> best model using numpy rather than ad hoc Python as in the current
> implementation. [This change would not be very backwards-compatible without
> some hackery to make the object returned from fit_grid_point also act like
> a tuple.]
>
> To facilitate composite scores, and perhaps storage of an estimator, we
> modify the generic Scorer:
>
> class Scorer(object):
> def store(self, result_dict, estimator, X, y, prefix='', SCORE='score'):
> result_dict[prefix + SCORE] = self(estimator, X, y)
> ...
>
> Training score can be stored with scorer.store(result_dict, estimator, X,
> y, prefix='train_').
>
> A PRFScorer variant would use score() to additionally set
> result_dict[prefix + 'precision'] and result_dict[prefix + 'recall'],
> though its __call__() would just return the F score.
>
> An arbitrary estimator summary (or storage of the whole estimator) could
> hackily be stored in the result_dict by wrapping the Scorer.
>
> Is that reasonably elegant? (Personally I think it's nicer than the status
> quo.) I could perhaps throw together a quick patch...
>
> Best,
>
> - Joel
>
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general