+1 for simple learning for simple cases.

Where normal equations have a reasonable condition number, using them is
good.

For large sparse systems, SGD with Adagrad will crush direct solutions,
however, even for linear problems.



On Thu, Jun 4, 2015 at 2:38 PM, Mikio Braun <mikiobr...@googlemail.com>
wrote:

> It's true that we can and should look into methods to make sgd more
> resilient, however, especially for linear regression, which even has a
> closed form solution, all this seems too excessive.
>
> I mean in the end, if the number of features is small (lets say less
> than 2000), the best way is to compute the covariance matrix and then
> just solve the problem. Even for larger method, we could use something
> like conjugate gradients to just compute the result. All of this will
> be much faster and have now additional parameters to tune.
>
> On Thu, Jun 4, 2015 at 1:26 PM, Till Rohrmann <trohrm...@apache.org>
> wrote:
> > At the moment the current SGD implementation works like (modulo
> > regularization): newWeights = oldWeights - adaptedStepsize *
> > sumOfGradients/numberOfGradients where adaptedStepsize =
> > initialStepsize/sqrt(iterationNumber) and sumOfGradients is the simple
> sum
> > of the gradients for all points in the batch.
> >
> > Thanks for the pointer Ted. These methods look really promising. We
> > definitely have to update our SGD implementation to use a better adaptive
> > learning rate strategy. I’ll open a JIRA for that.
> >
> > Maybe also the default learning rate of 0.1 is set too high.
> >
> >
> > On Thu, Jun 4, 2015 at 1:20 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >
> >> Any form of generalized linear regression should use adaptive learning
> >> rates rather than simple SGD.  One of the current best methods is
> adagrad
> >> although there are variants such as RMS prop and adadelta.  All are
> pretty
> >> easy to implement.
> >>
> >> Here is some visualization of various methods that provides some
> insights:
> >> http://imgur.com/a/Hqolp
> >>
> >> Vowpal wabbit has some tricks that allow very large initial learning
> rates
> >> to be used without divergence.  I don't know the details.
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Jun 3, 2015 at 8:05 PM, Mikio Braun <mikiobr...@googlemail.com>
> >> wrote:
> >>
> >> > We should probably look into this nevertheless. Requiring full grid
> >> search
> >> > for a simple algorithm like mlr sounds like overkill.
> >> >
> >> > Do you have written down the math of your implementation somewhere?
> >> >
> >> > -M
> >> >
> >> > ----- Ursprüngliche Nachricht -----
> >> > Von: "Till Rohrmann" <till.rohrm...@gmail.com>
> >> > Gesendet: ‎02.‎06.‎2015 11:31
> >> > An: "dev@flink.apache.org" <dev@flink.apache.org>
> >> > Betreff: Re: MultipleLinearRegression - Strange results
> >> >
> >> > Great to hear. This should no longer be a pain point once we support
> >> proper
> >> > cross validation.
> >> >
> >> > On Tue, Jun 2, 2015 at 11:11 AM, Felix Neutatz <
> neut...@googlemail.com>
> >> > wrote:
> >> >
> >> > > Yes, grid search solved the problem :)
> >> > >
> >> > > 2015-06-02 11:07 GMT+02:00 Till Rohrmann <till.rohrm...@gmail.com>:
> >> > >
> >> > > > The SGD algorithm adapts the learning rate accordingly. However,
> this
> >> > > does
> >> > > > not help if you choose the initial learning rate too large because
> >> then
> >> > > you
> >> > > > calculate a weight vector in the first iterations from which it
> takes
> >> > > > really long to recover.
> >> > > >
> >> > > > Cheer,
> >> > > > Till
> >> > > >
> >> > > > On Mon, Jun 1, 2015 at 7:15 PM, Sachin Goel <
> >> sachingoel0...@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > You can set the learning rate to be 1/sqrt(iteration number).
> This
> >> > > > usually
> >> > > > > works.
> >> > > > >
> >> > > > > Regards
> >> > > > > Sachin Goel
> >> > > > >
> >> > > > > On Mon, Jun 1, 2015 at 9:09 PM, Alexander Alexandrov <
> >> > > > > alexander.s.alexand...@gmail.com> wrote:
> >> > > > >
> >> > > > > > I've seen some work on adaptive learning rates in the past
> days.
> >> > > > > >
> >> > > > > > Maybe we can think about extending the base algorithm and
> >> comparing
> >> > > the
> >> > > > > use
> >> > > > > > case setting for the IMPRO-3 project.
> >> > > > > >
> >> > > > > > @Felix you can discuss this with the others on Wednesday, Manu
> >> will
> >> > > be
> >> > > > > also
> >> > > > > > there and can give some feedback, I'll try to send a link
> >> tomorrow
> >> > > > > > morning...
> >> > > > > >
> >> > > > > >
> >> > > > > > 2015-06-01 20:33 GMT+10:00 Till Rohrmann <
> trohrm...@apache.org>:
> >> > > > > >
> >> > > > > > > Since MLR uses stochastic gradient descent, you probably
> have
> >> to
> >> > > > > > configure
> >> > > > > > > the step size right. SGD is very sensitive to the right step
> >> size
> >> > > > > choice.
> >> > > > > > > If the step size is too high, then the SGD algorithm does
> not
> >> > > > converge.
> >> > > > > > You
> >> > > > > > > can find the parameter description here [1].
> >> > > > > > >
> >> > > > > > > Cheers,
> >> > > > > > > Till
> >> > > > > > >
> >> > > > > > > [1]
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://ci.apache.org/projects/flink/flink-docs-master/libs/ml/multiple_linear_regression.html
> >> > > > > > >
> >> > > > > > > On Mon, Jun 1, 2015 at 11:48 AM, Felix Neutatz <
> >> > > > neut...@googlemail.com
> >> > > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi,
> >> > > > > > > >
> >> > > > > > > > I want to use MultipleLinearRegression, but I got really
> >> > strange
> >> > > > > > results.
> >> > > > > > > > So I tested it with the housing price dataset:
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
> >> > > > > > > >
> >> > > > > > > > And here I get negative house prices - even when I use the
> >> > > training
> >> > > > > set
> >> > > > > > > as
> >> > > > > > > > dataset:
> >> > > > > > > > LabeledVector(-1.1901998613214253E78, DenseVector(1500.0,
> >> > 2197.0,
> >> > > > > > 2978.0,
> >> > > > > > > > 1369.0, 1451.0))
> >> > > > > > > > LabeledVector(-2.7411218018254747E78, DenseVector(4445.0,
> >> > 4522.0,
> >> > > > > > 4038.0,
> >> > > > > > > > 4223.0, 4868.0))
> >> > > > > > > > LabeledVector(-2.688526857613956E78, DenseVector(4522.0,
> >> > 4038.0,
> >> > > > > > 4351.0,
> >> > > > > > > > 4129.0, 4617.0))
> >> > > > > > > > LabeledVector(-1.3075960386971714E78, DenseVector(2001.0,
> >> > 2059.0,
> >> > > > > > 1992.0,
> >> > > > > > > > 2008.0, 2504.0))
> >> > > > > > > > LabeledVector(-1.476238770814297E78, DenseVector(1992.0,
> >> > 1965.0,
> >> > > > > > 1983.0,
> >> > > > > > > > 2300.0, 3811.0))
> >> > > > > > > > LabeledVector(-1.4298128754759792E78, DenseVector(2059.0,
> >> > 1992.0,
> >> > > > > > 1965.0,
> >> > > > > > > > 2425.0, 3178.0))
> >> > > > > > > > ...
> >> > > > > > > >
> >> > > > > > > > and a huge squared error:
> >> > > > > > > > Squared error: 4.799184832395361E159
> >> > > > > > > >
> >> > > > > > > > You can find my code here:
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/FelixNeutatz/wikiTrends/blob/master/extraction/src/test/io/sanfran/wikiTrends/extraction/flink/Regression.scala
> >> > > > > > > >
> >> > > > > > > > Can you help me? What did I do wrong?
> >> > > > > > > >
> >> > > > > > > > Thank you for your help,
> >> > > > > > > > Felix
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>
>
>
> --
> Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun
>

Reply via email to