scikit-learn's general parameter searches currently require calling
fit()on their estimator for every parameter variation, even those
where
re-fitting is unnecessary.
Andy has proposed a
solution<https://github.com/scikit-learn/scikit-learn/issues/1626>
which
involves providing the estimator with a set of values for each parameter
varied (i.e. a grid), and its predict function will predict a result for
each parameter setting. Unless its return value is a generator this may be
expensive in terms of memory, and coefficients for each setting may need to
also be returned; but mostly I think it's a bad idea because it sounds like
it will be difficult to implement as a simple extension of the current
API. (I also proposed a solution on that issue, but it has some big
flaws...)
After Lars mentioning that GridSearch needs to call fit for every transform,
I lay awake in bed last night and came up with the following:
BaseEstimator adds a refit(X, y=None, **params) method which has two
explicit preconditions:
1. the data arguments (X, y; and sample_weight, etc. not that I'm sure
how fit into the method signature) are identical to the most recent call to
fit().
2. params are exactly the changes since the last re/fit.
Both of these are implicit in the use of warm_start=True elsewhere, which
is one reason the latter should be deprecated in favour of a more explicit,
general API for minimal re-fitting.
The default implementation looks something like this:
def refit(self, X, y=None, **params):
self.set_params(**params)
if hasattr(self, '_refit_noop_params') and all(name in
self._refit_noop_params for name in iterkeys(params)):
return self
return self.fit(X, y)
For example, on SelectKBest, refit_noop_params would be ['k'] because fit
does not need to be called again when k is modified, though it does when
score_func is modified. Similarly, we need a refit_transform in
TransformerMixin.
We also implement BaseEstimator._plan_refits(self, param_iterator). This
has two return values. One is a reordering of param_iterator that attempts
to minimise work were fit was called with the first parameters, and refit
with all the subsequent parameters in order. The second is an expected cost
for each parameter setting if executed in this order.
For example:
SelectKBest._plan_refits(ParameterGrid({'score_func': [chi2,
f_classif], 'k': [10, 20]}))
might return:
([
{'score_func': chi2, 'k': 10},
{'score_func': chi2, 'k': 20},
{'score_func': f_classif, 'k': 10},
{'score_func': f_classif, 'k': 20},
],
array([1, 0, 1, 0])
)
(array([0, 0, 1, 0])would have the same effect as a cost and is what is
returned by the below implementation.)
GridSearch may then operate by first calling _plan_refits on its estimator,
and divide the work by folds and cost-based partitions of the reordered
parameter space, the parallelised function calling clone and fit once, and
refit many times.
A default implementation looks something like this:
def _plan_refits(self, param_iterator):
try:
NOOP_NAMES = set(self._refit_noop_params)
except AttributeError:
# presumably fit will be called every time
param_iterator = list(param_iterator)
return param_iterator, np.zeros(len(param_iterator))
# bin parameter settings by common non-noop params
groups = defaultdict(list)
for params in param_iterator:
# sort parameters into two types
op_params = []
noop_params = []
for k, v in iteritems(params):
(noop_params if k in NOOP_NAMES else op_params).append((k, v))
groups[tuple(sorted(op_params)].append(noop_params)
# merge bins and assign nonzero cost at transitions
groups = list(iteritems(groups))
reordered = sum((dict(op_params + noop_params) for op_params, noop_seq
in groups for noop_params in noop_iter), [])
costs = np.zeros(len(reordered))
costs[np.cumsum([len(noop_seq) for op_params, noop_seq in
groups[:-1]])] = 1
return reordered, costs
While these generic implementations have some properties such as working
entirely on the basis of parameter names and not values, we can't assume
that in the general case. In particular, a Pipeline implementation where
steps can be set requires a somewhat more sophisticated implementation, and
non-binary costs. Pipeline.refit may refit only the tail end of the
pipeline, depending on the parameters it's passed.
Cheers,
- Joel
------------------------------------------------------------------------------
AlienVault Unified Security Management (USM) platform delivers complete
security visibility with the essential security capabilities. Easily and
efficiently configure, manage, and operate all of your security controls
from a single console and one unified framework. Download a free trial.
http://p.sf.net/sfu/alienvault_d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general