Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

James Bergstra Fri, 21 Mar 2014 07:46:34 -0700

On Fri, Mar 21, 2014 at 8:50 AM, Issam <issamo...@gmail.com> wrote:

>  On 3/21/2014 4:25 PM, James Bergstra wrote:
>
> The proposal looks good to me! A few small comments:
>
>  1. I'm confused by the paragraph on regularized ELMs: I think you mean
> that in cases where the hidden weights (the classifier?) are
> *underdetermined* because there are far more *unknowns* then *samples* then
> you need to regularize somehow. (Right!?)
>
>
>
> I meant the opposite :) - there are usually far more "samples" than
> "unknowns". The unknowns depend on the number of hidden neurons and output
> neurons which is usually small.
>
> Typically the hidden weights matrix (the weights going out of the hidden
> neurons to the output neuron) is a 150x1 matrix. In other words there are
> 150 hidden neurons and 1 output neuron. This means there are 150 unknown
> variables . Since least-square solutions can be considered as systems of
> linear equations, solving for 150 unknown variables is possible with 150
> samples. But datasets usually are as large as 10, 000 samples, meaning the
> number of unique solutions are very large as well, hence overdetermined (
> http://en.wikipedia.org/wiki/Overdetermined_system).
>
> Therefore, regularization would constrain the amount of solutions by
> making sure they satisfy a meaningful constraint - like SVM's maximization
> of the margins between classes.
>
> Sorry that this wasn't clear in the proposal.
>


I get that if you have 10, 000 samples and 150 features, then your system
is over-determined.
Where I think you go wrong is in worrying about a large number of unique
solutions. Over-determined typically means 0 solutions! (Have another look
at that page you linked, it's the under-determined systems that need
explicit regularization to find a unique solution.)

SVM has max-margin regularization partly because of the shape of hinge
loss, not because of the number of samples. The hinge loss is a constant 0
for large parts of the input domain, so there isn't a single "best" point
on the loss function. Conventional regularizations like L1 and L2 push the
solution to be closer to the inflection point of the hinge.

Anyway, ML in general deals with noisy data (both in classification and
numeric regression) so that's actually the dominant reason why
regularization is used, even when the system is technically overdetermined.

For your proposal, it would probably be more accurate to explain that when
training data is noisy, it regularization during training can lead to more
accurate predictions on test data. That's why the regularized ELM is worth
implementing.


Also: new topic: Did you mention earlier in the thread that you need
derivatives to implement a regularized ELM? Why don't you just use some of
the existing linear (or even non-linear?) regression models in sklearn to
classify the features computed by the initial layers of the ELM? This is a
more detailed question that doesn't really affect your proposal, but I'd
like to hear your thoughts and maybe discuss it.


>  2. Testing: no mention of how you will test any of this work. It's hard
> to know when an ML algorithm is implemented well. How will you know?
> Usually reproducing published results is a good bar to aim for, which ones
> do you have in mind? E.g. if there are some results in your PhD thesis that
> you want to reproduce, then mention that. How long does it take to train
> such things, do you need access to big computers?
>
>
> That's the main motivation of using Extreme Learning Machines; they take
> seconds to train ;). The only obstacle is memory, because it processes the
> matrices all at once; however, this is where Sequential ELMs come in :).
>
> I will add another section explaining the evaluation of the algorithms. It
> would include, solving systems of linear equation by hand and comparing it
> with the algorithm's output; how does that sound? Obviously, this is
> besides testing for coding issues like checking whether the control flow
> works as intended.
>
> A bit cheesy, but I intend to cross-check the algorithms' outputs with
> that of the MATLAB's versions of the implementations, and theano's
> implementation of deep networks. :)
>
>
Sounds good, but I wouldn't be so confident they always take seconds to
train. I think some deep vision system models are pretty much just big
convolutional ELMs (e.g.
http://jmlr.org/proceedings/papers/v28/bergstra13.pdf) and they can take up
to say, an hour of GPU time to (a) compute all of the features for a big
data set and (b) train the linear output model. Depending on your data set
you might want to use more than 150 output neurons! When I was doing those
experiments, it seemed that models got better and better the more outputs I
used, they just take longer to train and eventually don't fit in memory.

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] GSoC - Completing my Neural Network PRs and more

Reply via email to