TL;DR: a list of `namedtuple`s is a poor solution for parameter search
results; here I suggest better alternatives.

I would like to draw some attention to #1787 which proposes that structured
arrays be used to return parameter search (e.g. GridSearchCV) results. A
few proposals have sought additional parameter search outputs (e.g.
training scores and times in #1742; multiple test metrics, such as P and R
where F1 is the objective, or per-class performance); structured arrays may
not be the right answer, but some solution should be selected.

In scikit-learn 0.13, results are a list of triples (parameters, mean
score, fold scores). Using tuples, or `namedtuple`s as in the current dev
version, is a particularly poor solution:

* it is not extensible if people will expect it to have fixed length, and
changes in namedtuple length break unpickling.
* it doesn't look like the output of other estimators, afaik.
* it is not especially convenient to access.

We need a format that can support more fields. As far as I can see this
means one of:

1. a sequence of dicts
2. a sequence of namespaces (like `namedtuple`s but not iterable)
3. a dict of arrays
4. many attributes, each an array, on the estimator
5. many attributes, each an array, on a custom results object
6. a structured array / recarray

All of these require the fields to be named (something not discussed enough
at #1787). Except for (2), (4) and (5) where descriptors can be used to
deprecate names and transform values, all those names and their values must
remain fixed across versions. I think (4) is most compatible with
scikit-learn's use of attributes and parameters (coindexed arrays are
common).

Structured arrays are good because they can be accessed in all dimensions
(search candidates, folds where relevant, and fields). They are bad because
they are not familiar to scikit-learn users and can be quirky to work with
(particularly if some fields have `dtype=object`).

It seems the common use-case for this data is to select one or more
candidates by their parameter values, and then to explore a few fields,
such as scores or times, their means or standard deviations. Structured
arrays (6) make this easy in some cases because slicing by index and the
zipped iteration over selected fields are included out of the box (but
`zip` is still needed to mix per-candidate and per-fold data).

It could be possible to enable this sort of functionality with (4) or (5)
-- slicing over the search candidates; iterating fields in parallel;
aggregating over folds -- but this increases API complexity and reinvents
the wheel (*).

So, please: consider the alternatives (**); and please don't lock in a list
of `namedtuple`s.

- Joel
(*) Essentially we're replicating `pandas.DataFrame` except that our
per-fold data is 2d and so doesn't fit into their `Series`. I guess having
a format that can be easily imported into a `DataFrame` (3, 6) has
advantages. See also #1034.
(**) My preferences are (4) for its simplicity, familiarity and
flexibility; and (6) because it can be easily transformed and uses an
appropriate, existing numpy data structure.
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to