On Sat, Oct 29, 2011 at 12:33:00AM +0900, Mathieu Blondel wrote:
> On Sat, Oct 29, 2011 at 12:18 AM, Olivier Grisel
> <[email protected]> wrote:
> > percent_val would be a constructor param in that case at it's not data
> > dependent.
> Good point!
> > I am +1 for X_val=None, y_val=None in fit for the GridSearchCV class
> Or maybe a new object GridSearchValidation, as the semantics are a bit
> different? (having X_val / y_val parameters seem incompatible with
> having a cv generator)
Sorry, for being dense, but I am still not getting the usecase. How is a
validation set different from a test set. And how is this different from
having a cv generator of length 1?
I guess that one difference lies in the copying that happens in the
concatenation of the datasets. If this is really the problem, let me
expose tricks that I use a lot on big datasets.
I sometimes have a list of arrays, rather than an actual array that I
pass to the GridSearch object. Right now nothing in the specs garantees
that the GridSearch or other cross-validation utilities will not break on
this. I would be open to making this part of the contract, as it is very
useful. Furthermore, nothing garantees that if the estimator will not do
something stupid on a list of arrays. In the specific usecase of one
predefined test and train set, it is fine, as the following data and cv
objects should solve the problem (not tested):
X = [X_train, X_test]
y = [y_train, y_test]
cv = [[0, 1]]
In the case of more complex patterns, it means that you need a
transformer that concatenates arrays on the fly, but that's really easy
to write.
Which brings me to another pattern that I have been using a lot in the
case of large data. I sometimes use as the X data (which is usually the
heavy part of the data) only keys in a data base, or filenames. I then
have a transformer that converts lists of keys to arrays by fetching the
data on the fly. With a good database that does caching, I find this both
efficient and memory-friendly.
Now, this is obviously a bit hard to guess. In my opinion this brings us
again to a documentation problem. We need to enrich the model-selection
chapter. We might also need a subchapter on large-scale learning that
discusses such tricks.
This discussion is full of useful tips and tricks. We need to pickle it
to the documentation at some point.
> Note that this would be a generic solution: what I was proposing is an
> API to take advantage of the problem specificities to make efficient
> use of the validation set.
Could you give an example of what you have in mind? I am probably just
revealing my lack of knowledge here.
Cheers,
Gaƫl
------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn
about Cisco certifications, training, and career opportunities.
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general