On Mon, Mar 11, 2013 at 11:24 AM, Joel Nothman <[email protected]
> wrote:
> On 03/10/2013 16:42:44 +0100, Andreas Mueller wrote:
>>
>> As an aside: if you had all fitted estimators, it would also be quite
>> easy to compute the other scores, right?
>> Would that be an acceptable solution for you?
>>
>
> I guess so (noting that a modified scorer with the above could store the
> estimator as well)... Perhaps that's a reasonable option -- its main
> benefit over the above is less obfuscation -- though I to worry that
> storing all estimators in the general case is expensive.
>
I should note that getting all fitted estimators back means I would still
need reproduce all the test folds (let's hope there's no randomisation in
there or that's impossible) and perform prediction to get the stats I want.
Just thinking a moment more about whether there is an elegant solution,
particularly considering the following cases:
* storing the estimator
* storing a summary of the estimator's model, e.g. sparsity produced by
regularisation
* storing multiple metrics (e.g. PRF or per-class metrics), with one to be
used as the primary score
Looking at this generically, what we want to be stored in cv_scores_ (or
perhaps cv_results_; or grid_results_ and cv_results_ separately) is a
collection of named numpy arrays indexed by [grid_index] (like parameters
and aggregate scores currently, though they're now in lists, not numpy
arrays), and another set indexed by [grid_index, fold_no] (like
cross-validation scores).
To facilitate this, let's assume that fit_grid_point returns a dict (or
equivalently an object with a __dict__), such as training/testing
score/time and number of samples. In GridSearchCV we can easily create an
array of all folds for each key (assuming the set of keys is constant
across folds). We can also then produce the aggregate scores and select the
best model using numpy rather than ad hoc Python as in the current
implementation. [This change would not be very backwards-compatible without
some hackery to make the object returned from fit_grid_point also act like
a tuple.]
To facilitate composite scores, and perhaps storage of an estimator, we
modify the generic Scorer:
class Scorer(object):
def store(self, result_dict, estimator, X, y, prefix='', SCORE='score'):
result_dict[prefix + SCORE] = self(estimator, X, y)
...
Training score can be stored with scorer.store(result_dict, estimator, X,
y, prefix='train_').
A PRFScorer variant would use score() to additionally set
result_dict[prefix + 'precision'] and result_dict[prefix + 'recall'],
though its __call__() would just return the F score.
An arbitrary estimator summary (or storage of the whole estimator) could
hackily be stored in the result_dict by wrapping the Scorer.
Is that reasonably elegant? (Personally I think it's nicer than the status
quo.) I could perhaps throw together a quick patch...
Best,
- Joel
------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the
endpoint security space. For insight on selecting the right partner to
tackle endpoint security challenges, access the full report.
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general