On 05/20/2013 02:46 PM, Joel Nothman wrote:

I agree. My approach doesn't necessarily exclude this working if one of:
* sorting parameters in descending order is sufficient;
That is estimator dependent.
* we extend the role of _plan_refits to being one of preparation, so the estimator may set some state like a search range (which would need to be copied in clone()); * we extend _plan_refits to allowing it to return the parameter settings modified (though this may make implementing the Pipeline version harder); or * we extend _plan_refits allowing it to return some additional information to be passed to fit/refit (this will definitely make implementing the Pipeline version harder).

    Basically your proposal addresses cases where one doesn't need to
    touch parts of the pipeline at all.
    It wouldn't help us get rid of any of the CV objects, though.


It also helps get rid of anything that may warm start from a previous solution...

    Is there something interesting about StandardScaler, or have you
    thrown it in for fun? or for an example where transform is more
    expensive than fit?

    Just for fun ;) Basically I thought that was one that you don't
    really need to refit at all (for a given fold) as you usually
    don't search over any parameters.


Not refitting at all is easy. Not transforming at all is left till later.


So, let's take something like your proposal, but instead of having lists of values for each parameter (which assumes a grid), we have lists of parameter settings. So we have a method on each estimator such as:

def iter_fits(self, param_iter, X, y=None):
    """Generate models for each of the given parameter settings
    """

A default implementation would be an expansion of:

    param_iter, costs = self._plan_refits(param_iter)
    for params in param_iter:
        yield params, self.refit(X, y, params)

(It similarly needs a fit_transform variant.)

Note the generator references the parameters (or it could just be the index into the parameters) as well as the model, so that they may be reordered; and it generally would yield self as the second argument. By yielding from the generator, we have full access to the model and its predicting functions.

I like the look of this better, though it means there's no option for cleverness about multiprocessing. And the recursive execution of a Pipeline would be somewhat neater and not require memoizing for transform.

What do you mean with "cleverness about multiprocessing"?
Somewhere a decision has to be made which computations should be parallelized and which should be serial. The splitting into folds should be in GridSearchCV. So I don't entirely see how this would work.

Basically GridSearchCV would need to query the estimator to know which parameters should be searched over serially and in which order,
so it can do the dispatching.
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to