2013/6/9 Joel Nothman <jnoth...@student.usyd.edu.au>: > Thanks, Olivier. Those are some interesting use-cases: > >> A- Fault tolerance and handling missing results caused by evaluation >> errors > > I don't think this affects the output format, except where we can actually > get partial results for a fold, or if we want to report successful folds and > ignore others for a single candidate parameter setting. But I wonder if that > just makes things much too complicated.
It's not complicated to store successful results in a list and failed parameters + matching error tracebacks in another. The log of successful evaluation could either be a list of dicts of a list of namedtuples. The list of dicts optional is probably more flexible if we want to make it possible the user to collect additional evaluation attribute by passing a callback for instance. >> B: Being able to monitor partial results and interrupt search before >> waiting for the end (e.g. by handling KeyBoardInterrupt using an async job >> scheduling API) > > So the stop and resume case just means the results need to be appendable...? Yes mostly. But that also mean that we should be able to compute mean scores over 2 out 5 folds and then recompute the mean scores latter when we get access to the 5 folds results. Hence my proposal is to store the raw dummy list of evaluations and offer public methods to compute aggregate user friendly summaries of the partial or complete results. > In general, I don't think Parallel's returning a list is of great benefit > here. Working with an iterable would be more comfortable. Yes we might need to make joblib.Parallel evolve to support task submission and async retrieval to implement this. I think this is one of the possible design goal envisioned by Gael as possible evolution of the joblib project. >> C1: Refining the search space > > Similarly, it should be possible to have fit append further results. Yes. >> C2: Refining the cross-validation > and >> D: warm starting with larger subsamples of the dataset > > I would think in these cases it's better to create a new estimator and/or > keep results separate. Although I think those are two very important to manage the exploration / exploitation trade-off faced by the ML researches and practicionners, I also agree they could be addressed in later evolution scikit-learn or even maybe as separate projects as https://github.com/jaberg/hyperopt or https://github.comm/pydata/pyrallel I would just like to emphasize that storing the raw evaluations logs as a dummy python list would make it possible to deal with this kind of future evoluations if we ever decide to implement them directly in scikit-learn. Hence I think that data structure that stores the evaluations results should be as simple as possible and avoid making any assumptions on the kind of aggregation or the number of axis we will collect during the search. Basically adding support for sub-sampling will add a new axis for possible aggregations and if we use 2D numpy rec-arrays as the primary datastructure with 1 row per parameter settings we won't be able to implement that use case at all without breaking the API once again. >> Optional attributes we could add in the future: > > Something you missed: the ability to get back diagnostics on the quality / > complexity of the model, e.g. coefficient sparsity. Yes. I think we could extend the fit_grid_point API to make it possible to pass an arbitrary python callback that would have access to the fitted estimator and the CV fold and collect any kind of additional model properties to be included in the search report. > These suggestions do make me consider storage in an external database (a > blob store, or an online spreadsheet) as hyperopt allows. I think "allows" > is important here: when you get to that scale of experimentation, you > probably don't want results logged only in memory. But we need a sensible > default for working with a few thousand candidates. I agree, but I think we should keep that thread > Except for purity of parallelism, I don't see why you would want do store > each fold result for a single candidate separately. I don't see the use-case > for providing them separately to the user (except where one fold failed and > another succeeded). To make it easy to: - deal with partial / incomplete results (either for fault tolerance or early stopping / monitoring) - extend the size of an existing dimension (e.g. collecting 5 random CV folds instead of 3) in a warm restart of the search. - add a new dimension (e.g. subsamples of the dataset), possibly in warm restart of the search instance. by not making any assumptions on the kind of estimates the user will want in the future of the lib. > As far as I'm concerned, the frontend should hide that. Yes that's why I propose to provide public methods to compute interesting aggregates from the raw evaluation log. > I do see that providing all fields together for a single candidate is the > most common use-case and argues against providing parallel arrays (but not > against a structured array / recarray). structured array / recarray have 2 issues: - they badly handle missing / partial results or at least there is not uniform solution as missing data markers would depend on the dtype of the column, e.g.: NaNs for floats, -1 as a marker for ints, None for dtype=object? Furthermore missing results are pre-allocated. - they do not naturally handle change in dimension sizes or number of dimensions: > Finally, the single most important thing I can see about making results > explorable is not providing candidate parameter settings only as dicts, but > splitting the dicts out so that you can query by the value of each > parameter, and group over others. Yes but if we go for the simple evaluation log list I propose, this can be always be implemented provided by dedicated methods. Furthermore be aware that the number of parameters is not always the same for each result item of a GridSearchCV: See: http://scikit-learn.org/stable/modules/grid_search.html#gridsearchcv This is a valid param grid: param_grid = [ {'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, ] The gamma attribute is only present when `kernel == 'rbf'`. Expanding this in column of a rec array is not very natural I think. This is similar to the sparsity issue mentioned earlier. > This may be getting into crazy land, and certainly close to reimplementing > Pandas for the 2d case, or recarrays with benefits, but: imagine we had a > SearchResult object with: > * attributes like fold_test_score, fold_train_score, fold_train_time, each a > 2d array. > * __getattr__ magic that produced mean_test_score, mean_train_time, etc. and > std_test_score, std_train_time on demand (weighted by some samples_per_fold > attr if iid=True). > * attributes like param_C that would enable selecting certain candidates by > their parameter settings (through numpy-style boolean queries). > * __getitem__ that can pull out one or more candidates by index (and returns > a SearchResult). > * a method that return a dict of selected 1d array attributes for > Pandas-style (or spreadsheet? in that case a list of dicts) integration > * a method that zips over selected attributes for simple iteration. > > Is this crazy, or does it do exactly what we want? or both? And how does it > not meet the needs of your wishlist, Olivier (except where the number of > folds differ)? Interesting but I am not sure I understand it all. Can you give an example of a typical series of instructions that would leverage such a SearchResult object from an interactive python sessions to introspect it? Furthermore, such a SeachResult instance could always be computed on demand or at the end of the computation from the raw evaluation log. Or even wrap the raw evaluation log internally. Basically I am advocating Event Sourcing [1] as a design goal for the primary datastructure to store the evaluation results. Let us make as few assumptions as possible on the kind of data we want to collect and who the user will aggregate those data to find the best models. [1] http://martinfowler.com/eaaDev/EventSourcing.html -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general