On Fri, Nov 04, 2011 at 04:33:39PM +0100, Andreas Müller wrote:
> Hey Frederic.
> > I don't have a good understanding of scikit.learn, but I think that
> > all the hyper-parameter selection is a hot research topic for now. How
> > do you plan to include this in the current scikit.learn interface of
> > the fit method?
> >
> Depends on what you think of when you say hyper parameters?
> Things like learning rate, weight decay and size of the hidden
> layer can be cross validated.

Cross-validating one or two hyperparameters is fine, but once you get into
the regime of 5-10 hyperparameters (initial learning rate, momentum,
annealing schedule, batch size, activation function, initialization
distributions...), grid search becomes quite costly, and yet tuning these
things can be essential if you want to even equal the performance of an SVM
(you can, of course, do things like randomly sample your hyperparameters, but
this requires a bit of domain expertise in determining what constitutes
a"reasonable" distribution to draw each one from).

Really one of the best ways of avoiding overfitting is to do early stopping,
but in order to do this properly in the context of cross-validation, you need
two held-out sets, one validation set for monitoring when to stop and one to
estimate your test error for this CV fold. The rabbit hole just gets deeper
from there, I'm afraid.

> Of course there are many other possibilities like pretraining,
> deeper networks, different learning rate schedules etc..
> You are right, this is somewhat of an active research field
> Though I have not seen conclusive evidence that any
> of these methods are consistently better than a vanilla mlp.

http://www.dumitru.ca/files/publications/icml_07.pdf the table on page 7
makes a pretty compelling case, I'd say.

Now, there's also the results out of Juergen Schmidhuber's lab that show that
if you train for months on a GPU, add all kinds of prior knowledge into the
preprocessing pipeline, make careful choices about the learning rate
schedule, initialization, and activation function (some of this is pretty
easy and well-documented in that paper by Yann LeCun that Olivier sent around
earlier in the thread, other parts will take a lot of fiddling), then you
*can* make vanilla MLPs perform really well on MNIST, but this says more
about the devotion of the practitioners to this (rather artificial) task, and
the sorts of built-in prior knowledge they used, than it does about the
strength of the learning algorithm.

David

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to