On Sat, Oct 29, 2011 at 12:33:00AM +0900, Mathieu Blondel wrote:
> On Sat, Oct 29, 2011 at 12:18 AM, Olivier Grisel
> <[email protected]> wrote:

> > percent_val would be a constructor param in that case at it's not data
> > dependent.

> Good point!

> > I am +1 for X_val=None, y_val=None in fit for the GridSearchCV class

> Or maybe a new object GridSearchValidation, as the semantics are a bit
> different? (having X_val / y_val parameters seem incompatible with
> having a cv generator)

Sorry, for being dense, but I am still not getting the usecase. How is a
validation set different from a test set. And how is this different from
having a cv generator of length 1? 

I guess that one difference lies in the copying that happens in the
concatenation of the datasets. If this is really the problem, let me
expose tricks that I use a lot on big datasets.

I sometimes have a list of arrays, rather than an actual array that I
pass to the GridSearch object. Right now nothing in the specs garantees
that the GridSearch or other cross-validation utilities will not break on
this. I would be open to making this part of the contract, as it is very
useful. Furthermore, nothing garantees that if the estimator will not do
something stupid on a list of arrays. In the specific usecase of one
predefined test and train set, it is fine, as the following data and cv
objects should solve the problem (not tested):

    X  = [X_train, X_test]
    y  = [y_train, y_test]
    cv = [[0, 1]]

In the case of more complex patterns, it means that you need a
transformer that concatenates arrays on the fly, but that's really easy
to write.

Which brings me to another pattern that I have been using a lot in the
case of large data. I sometimes use as the X data (which is usually the
heavy part of the data) only keys in a data base, or filenames. I then
have a transformer that converts lists of keys to arrays by fetching the
data on the fly. With a good database that does caching, I find this both
efficient and memory-friendly.

Now, this is obviously a bit hard to guess. In my opinion this brings us
again to a documentation problem. We need to enrich the model-selection
chapter. We might also need a subchapter on large-scale learning that
discusses such tricks.

This discussion is full of useful tips and tricks. We need to pickle it
to the documentation at some point.

> Note that this would be a generic solution: what I was proposing is an
> API to take advantage of the problem specificities to make efficient
> use of the validation set.

Could you give an example of what you have in mind? I am probably just
revealing my lack of knowledge here.

Cheers,

Gaƫl

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to