Re: [Scikit-learn-general] Composite scores in grid_search.BaseSearchCV

Joel Nothman Mon, 11 Mar 2013 23:17:37 -0700

PS: notice how easy it would be to add
results['elapsed'] = time.time() - start_time
into fit_grid_point and voila! a 2d array of elapsed times...


On Tue, Mar 12, 2013 at 5:05 PM, Joel Nothman
<[email protected]>wrote:

> An implementation, without backwards compatibility or new tests (but with
> tests.test_grid_search modified to pass) at
> https://github.com/jnothman/scikit-learn/tree/cv-enhanced-results (
> c9d45a3444<https://github.com/jnothman/scikit-learn/commit/c9d45a3444e6474901dca13eaeaef83b708bd969>).
> I hope you like the general idea; I'm not sure you'll like the current
> attribute names...
>
> - Joel
>
>
> On Tue, Mar 12, 2013 at 2:15 PM, Joel Nothman <
> [email protected]> wrote:
>
>> On Mon, Mar 11, 2013 at 11:24 AM, Joel Nothman <
>> [email protected]> wrote:
>>
>>> On 03/10/2013 16:42:44 +0100, Andreas Mueller wrote:
>>>>
>>>>  As an aside: if you had all fitted estimators, it would also be quite
>>>> easy to compute the other scores, right?
>>>> Would that be an acceptable solution for you?
>>>>
>>>
>>> I guess so (noting that a modified scorer with the above could store the
>>> estimator as well)... Perhaps that's a reasonable option -- its main
>>> benefit over the above is less obfuscation -- though I to worry that
>>> storing all estimators in the general case is expensive.
>>>
>>
>> I should note that getting all fitted estimators back means I would still
>> need reproduce all the test folds (let's hope there's no randomisation in
>> there or that's impossible) and perform prediction to get the stats I want.
>>
>> Just thinking a moment more about whether there is an elegant solution,
>> particularly considering the following cases:
>> * storing the estimator
>> * storing a summary of the estimator's model, e.g. sparsity produced by
>> regularisation
>> * storing multiple metrics (e.g. PRF or per-class metrics), with one to
>> be used as the primary score
>>
>> Looking at this generically, what we want to be stored in cv_scores_ (or
>> perhaps cv_results_; or grid_results_ and cv_results_ separately) is a
>> collection of named numpy arrays indexed by [grid_index] (like parameters
>> and aggregate scores currently, though they're now in lists, not numpy
>> arrays), and another set indexed by [grid_index, fold_no] (like
>> cross-validation scores).
>>
>> To facilitate this, let's assume that fit_grid_point returns a dict (or
>> equivalently an object with a __dict__), such as training/testing
>> score/time and number of samples. In GridSearchCV we can easily create an
>> array of all folds for each key (assuming the set of keys is constant
>> across folds). We can also then produce the aggregate scores and select the
>> best model using numpy rather than ad hoc Python as in the current
>> implementation. [This change would not be very backwards-compatible without
>> some hackery to make the object returned from fit_grid_point also act like
>> a tuple.]
>>
>> To facilitate composite scores, and perhaps storage of an estimator, we
>> modify the generic Scorer:
>>
>> class Scorer(object):
>>   def store(self, result_dict, estimator, X, y, prefix='', SCORE='score'):
>>     result_dict[prefix + SCORE] = self(estimator, X, y)
>>   ...
>>
>> Training score can be stored with scorer.store(result_dict, estimator, X,
>> y, prefix='train_').
>>
>> A PRFScorer variant would use score() to additionally set
>> result_dict[prefix + 'precision'] and result_dict[prefix + 'recall'],
>> though its __call__() would just return the F score.
>>
>> An arbitrary estimator summary (or storage of the whole estimator) could
>> hackily be stored in the result_dict by wrapping the Scorer.
>>
>> Is that reasonably elegant? (Personally I think it's nicer than the
>> status quo.) I could perhaps throw together a quick patch...
>>
>> Best,
>>
>> - Joel
>>
>
>

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Composite scores in grid_search.BaseSearchCV

Reply via email to