TL;DNR: parameter search results datastructure choice should anticipate new use-cases
Thanks Joel for the detailed analysis. I the current situation I think I my-self I like: 5. many attributes, each an array, on a custom results object This makes it possible to write a `__repr__` method on that object that could write a statistical summary of the top 10 or so candidate parameterizations. I thinks we should keep `best_param_`, `best_estimator_` and `best_score_` as quick access convenience accessors even if they are redundant with the detailed content of the search results. However to move the discussion forward on the model evaluation results there are three additional use-cases currently not addressed by the current design but that I would like to be have addressed somehow at some point in the future: A- Fault tolerance and handling missing results caused by evaluation errors How to handle partial results? Sometimes some combinations of the parameters will trigger runtime errors, for instance if the evaluation raises an exception if the estimator fails to convergence (ill-conditioning) or numeric overflow / underflow (apparently this can happen in our SGD cython code and raises a ValueError, to be debugged) or memory error... I think the whole search should not crash if one evaluation fails after 3 hours of computation and many successful evaluations. The error should be collected and the evaluation iteration should be excluded from the final results statistics. B- Being able to monitor partial results and interrupt search before waiting for the end (e.g. by handling KeyBoardInterrupt using an async job scheduling API) Also, even if the current joblib API does not allow for that, I think it would be very useful to make it possible at some point to allow the user to monitor the current progress in the search and allow him to interrupt it without loosing access to the evaluation results collected up to that point. C- Being able to warm-start a search with previously collected results C1: Refining the search space: Submit a new grid or parameter sampler that focus the search at a finer scale around an interesting area in existing dimensions and optionally trim dimensions that are deemed useless by the user according to the past results. C2: Refining the cross-validation: the user might want to perform a first search with very low number of CV (e.g. 1 or 2 iterations of shuffle split) to have a coarse overview of the interesting part of the search space, then trim the parameter grid to a smaller yet promising grid and then add more CV iterations only for those parameters so as to be able get finer estimates of the mean validation scores by reducing the standard error of the mean across random CV folds. Note: C2 is only useful for the (Stratified)ShuffleSplit cross validation where you can grow n_iter or change random_state to get as many CV split as you want provided the dataset is large enough. In order to be able to address A, B and C in the future, I think the estimator object should adopt a simple primary datastructure that is a growable list of individual (parameter, CV-fold)-scoped evaluations and then provide the user with methods to simply introspect the, such as: find the top 10 parameters by average validation scores across currently available CV fold (some CV fold could be missing due to some partial evaluation caused by A (failures) or B (interrupted computation)). Each item in this list could have: - parameters_id: unique parameter set integer identifier (e.g. a deep hash or random index) - parameters: the parameter settings dict - cv_id: unique CV object integer identifier (hash of the of the CV object or random index) - cv_iter_index: the CV fold iteration integer index - validation_score_name: the primary validation score (to be used for ranking models) Optional attributes we could add in the future: - training score to be able to estimate under-fitting (if non-zero) and over-fitting by diffing with the validation score - more training an validation scores (e.g. precision, recall, AUC...) - more evaluation metrics that are not scores by useful for model analysis (e.g. a confusion matrix for classifiaction) - fitting time - prediction time (could be complicate to separate out of the complete scoring time due to our Scorer API that currently hides it). Then to compute the mean score for a given parameter sets one could group-by parameters_id (e.g. using a python `defaultdict(list)` with parameter_id as key). Advanced users could also convert this log of evaluation as a pandas dataframe and then do joins / group-by themselves to compute various aggregate statistics across the dimensions of there choice. Finally there is an additional use case that I have in mind even if possibly less a priority than the other: D: warm starting with larger subsamples of the dataset Make it possible to start the search on a small sub sample of the datasets (e.g. 10% of the complete dataset) , then with a larger subset (e.g. with 20% of the dataset) to be able to identify the most promising parameterization quickly and evaluate how sensitive they are sensitive to a doubling of the dataset size. That would make it possible to select a smaller grid for a parameter search on the full dataset and also being able to compute learning curves for bias-variance analysis of the individual parameters. -- Olivier ------------------------------------------------------------------------------ How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general