TL;DNR: parameter search results datastructure choice should
anticipate new use-cases

Thanks Joel for the detailed analysis.

I the current situation I think I my-self I like:

5. many attributes, each an array, on a custom results object

This makes it possible to write a `__repr__` method on that object
that could write a statistical summary of the top 10 or so candidate
parameterizations.

I thinks we should keep `best_param_`, `best_estimator_` and
`best_score_` as quick access convenience accessors even if they are
redundant with the detailed content of the search results.

However to move the discussion forward on the model evaluation results
there are three additional use-cases currently not addressed by the
current design but that I would like to be have addressed somehow at
some point in the future:

A- Fault tolerance and handling missing results caused by evaluation errors

How to handle partial results? Sometimes some combinations of the
parameters will trigger runtime errors, for instance if the evaluation
raises an exception if the estimator fails to convergence
(ill-conditioning) or numeric overflow / underflow (apparently this
can happen in our SGD cython code and raises a ValueError,
to be debugged) or memory error...

I think the whole search should not crash if one evaluation fails
after 3 hours of computation and many successful evaluations. The
error should be collected and the evaluation iteration should be
excluded from the final results statistics.

B- Being able to monitor partial results and interrupt search before
waiting for the end (e.g. by handling KeyBoardInterrupt using an async
job scheduling API)

Also, even if the current joblib API does not allow for that, I think
it would be very useful to make it possible at some point to allow the
user to monitor the current progress in the search and allow him to
interrupt it without loosing access to the evaluation results
collected up to that point.

C- Being able to warm-start a search with previously collected results

C1: Refining the search space: Submit a new grid or parameter sampler
that focus the search at a finer scale around an interesting area in
existing dimensions and optionally trim dimensions that are deemed
useless by the user according to the past results.

C2: Refining the cross-validation: the user might want to perform a
first search with very low number of CV (e.g. 1 or 2 iterations of
shuffle split) to have a coarse overview of the interesting part of
the search space, then trim the parameter grid to a smaller yet
promising grid and then add more CV iterations only for those
parameters so as to be able get finer estimates of the mean validation
scores by reducing the standard error of the mean across random CV
folds.

Note: C2 is only useful for the (Stratified)ShuffleSplit cross
validation where you can grow n_iter or change random_state to get as
many CV split as you want provided the dataset is large enough.

In order to be able to address A, B and C in the future, I think the
estimator object should adopt a simple primary datastructure that is a
growable list of individual  (parameter, CV-fold)-scoped evaluations
and then provide the user with methods to simply introspect the, such
as: find the top 10 parameters by average validation scores across
currently available CV fold (some CV fold could be missing due to some
partial evaluation caused by A (failures) or B (interrupted
computation)).

Each item in this list could have:

- parameters_id: unique parameter set integer identifier (e.g. a deep
hash or random index)
- parameters: the parameter settings dict
- cv_id: unique CV object integer identifier (hash of the of the CV
object or random index)
- cv_iter_index: the CV fold iteration integer index
- validation_score_name: the primary validation score (to be used for
ranking models)

Optional attributes we could add in the future:

- training score to be able to estimate under-fitting (if non-zero)
and over-fitting by diffing with the validation score
- more training an validation scores (e.g. precision, recall, AUC...)
- more evaluation metrics that are not scores by useful for model
analysis (e.g. a confusion matrix for classifiaction)
- fitting time
- prediction time (could be complicate to separate out of the complete
scoring time due to our Scorer API that currently hides it).

Then to compute the mean score for a given parameter sets one could
group-by parameters_id (e.g. using a python `defaultdict(list)` with
parameter_id as key).
Advanced users could also convert this log of evaluation as a pandas
dataframe and then do joins / group-by themselves to compute various
aggregate statistics across the dimensions of there choice.

Finally there is an additional use case that I have in mind even if
possibly less a priority than the other:

D: warm starting with larger subsamples of the dataset

Make it possible to start the search on a small sub sample of the
datasets (e.g. 10% of the complete dataset) , then with a larger
subset (e.g. with 20% of the dataset) to be able to identify the most
promising parameterization quickly and evaluate how sensitive they are
sensitive to a doubling of the dataset size. That would make it
possible to select a smaller grid for a parameter search on the full
dataset and also being able to compute learning curves for
bias-variance analysis of the individual parameters.

--
Olivier

------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to