On 05/20/2013 02:46 PM, Joel Nothman wrote:
I agree. My approach doesn't necessarily exclude this working if one of:
* sorting parameters in descending order is sufficient;
That is estimator dependent.
* we extend the role of _plan_refits to being one of preparation, so
the estimator may set some state like a search range (which would need
to be copied in clone());
* we extend _plan_refits to allowing it to return the parameter
settings modified (though this may make implementing the Pipeline
version harder); or
* we extend _plan_refits allowing it to return some additional
information to be passed to fit/refit (this will definitely make
implementing the Pipeline version harder).
Basically your proposal addresses cases where one doesn't need to
touch parts of the pipeline at all.
It wouldn't help us get rid of any of the CV objects, though.
It also helps get rid of anything that may warm start from a previous
solution...
Is there something interesting about StandardScaler, or have you
thrown it in for fun? or for an example where transform is more
expensive than fit?
Just for fun ;) Basically I thought that was one that you don't
really need to refit at all (for a given fold) as you usually
don't search over any parameters.
Not refitting at all is easy. Not transforming at all is left till later.
So, let's take something like your proposal, but instead of having
lists of values for each parameter (which assumes a grid), we have
lists of parameter settings. So we have a method on each estimator
such as:
def iter_fits(self, param_iter, X, y=None):
"""Generate models for each of the given parameter settings
"""
A default implementation would be an expansion of:
param_iter, costs = self._plan_refits(param_iter)
for params in param_iter:
yield params, self.refit(X, y, params)
(It similarly needs a fit_transform variant.)
Note the generator references the parameters (or it could just be the
index into the parameters) as well as the model, so that they may be
reordered; and it generally would yield self as the second argument.
By yielding from the generator, we have full access to the model and
its predicting functions.
I like the look of this better, though it means there's no option for
cleverness about multiprocessing. And the recursive execution of a
Pipeline would be somewhat neater and not require memoizing for transform.
What do you mean with "cleverness about multiprocessing"?
Somewhere a decision has to be made which computations should be
parallelized and which should be serial.
The splitting into folds should be in GridSearchCV. So I don't entirely
see how this would work.
Basically GridSearchCV would need to query the estimator to know which
parameters should be searched over serially and in which order,
so it can do the dispatching.
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general