+1 for simple learning for simple cases. Where normal equations have a reasonable condition number, using them is good.
For large sparse systems, SGD with Adagrad will crush direct solutions, however, even for linear problems. On Thu, Jun 4, 2015 at 2:38 PM, Mikio Braun <mikiobr...@googlemail.com> wrote: > It's true that we can and should look into methods to make sgd more > resilient, however, especially for linear regression, which even has a > closed form solution, all this seems too excessive. > > I mean in the end, if the number of features is small (lets say less > than 2000), the best way is to compute the covariance matrix and then > just solve the problem. Even for larger method, we could use something > like conjugate gradients to just compute the result. All of this will > be much faster and have now additional parameters to tune. > > On Thu, Jun 4, 2015 at 1:26 PM, Till Rohrmann <trohrm...@apache.org> > wrote: > > At the moment the current SGD implementation works like (modulo > > regularization): newWeights = oldWeights - adaptedStepsize * > > sumOfGradients/numberOfGradients where adaptedStepsize = > > initialStepsize/sqrt(iterationNumber) and sumOfGradients is the simple > sum > > of the gradients for all points in the batch. > > > > Thanks for the pointer Ted. These methods look really promising. We > > definitely have to update our SGD implementation to use a better adaptive > > learning rate strategy. I’ll open a JIRA for that. > > > > Maybe also the default learning rate of 0.1 is set too high. > > > > > > On Thu, Jun 4, 2015 at 1:20 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > >> Any form of generalized linear regression should use adaptive learning > >> rates rather than simple SGD. One of the current best methods is > adagrad > >> although there are variants such as RMS prop and adadelta. All are > pretty > >> easy to implement. > >> > >> Here is some visualization of various methods that provides some > insights: > >> http://imgur.com/a/Hqolp > >> > >> Vowpal wabbit has some tricks that allow very large initial learning > rates > >> to be used without divergence. I don't know the details. > >> > >> > >> > >> > >> > >> > >> On Wed, Jun 3, 2015 at 8:05 PM, Mikio Braun <mikiobr...@googlemail.com> > >> wrote: > >> > >> > We should probably look into this nevertheless. Requiring full grid > >> search > >> > for a simple algorithm like mlr sounds like overkill. > >> > > >> > Do you have written down the math of your implementation somewhere? > >> > > >> > -M > >> > > >> > ----- Ursprüngliche Nachricht ----- > >> > Von: "Till Rohrmann" <till.rohrm...@gmail.com> > >> > Gesendet: 02.06.2015 11:31 > >> > An: "dev@flink.apache.org" <dev@flink.apache.org> > >> > Betreff: Re: MultipleLinearRegression - Strange results > >> > > >> > Great to hear. This should no longer be a pain point once we support > >> proper > >> > cross validation. > >> > > >> > On Tue, Jun 2, 2015 at 11:11 AM, Felix Neutatz < > neut...@googlemail.com> > >> > wrote: > >> > > >> > > Yes, grid search solved the problem :) > >> > > > >> > > 2015-06-02 11:07 GMT+02:00 Till Rohrmann <till.rohrm...@gmail.com>: > >> > > > >> > > > The SGD algorithm adapts the learning rate accordingly. However, > this > >> > > does > >> > > > not help if you choose the initial learning rate too large because > >> then > >> > > you > >> > > > calculate a weight vector in the first iterations from which it > takes > >> > > > really long to recover. > >> > > > > >> > > > Cheer, > >> > > > Till > >> > > > > >> > > > On Mon, Jun 1, 2015 at 7:15 PM, Sachin Goel < > >> sachingoel0...@gmail.com> > >> > > > wrote: > >> > > > > >> > > > > You can set the learning rate to be 1/sqrt(iteration number). > This > >> > > > usually > >> > > > > works. > >> > > > > > >> > > > > Regards > >> > > > > Sachin Goel > >> > > > > > >> > > > > On Mon, Jun 1, 2015 at 9:09 PM, Alexander Alexandrov < > >> > > > > alexander.s.alexand...@gmail.com> wrote: > >> > > > > > >> > > > > > I've seen some work on adaptive learning rates in the past > days. > >> > > > > > > >> > > > > > Maybe we can think about extending the base algorithm and > >> comparing > >> > > the > >> > > > > use > >> > > > > > case setting for the IMPRO-3 project. > >> > > > > > > >> > > > > > @Felix you can discuss this with the others on Wednesday, Manu > >> will > >> > > be > >> > > > > also > >> > > > > > there and can give some feedback, I'll try to send a link > >> tomorrow > >> > > > > > morning... > >> > > > > > > >> > > > > > > >> > > > > > 2015-06-01 20:33 GMT+10:00 Till Rohrmann < > trohrm...@apache.org>: > >> > > > > > > >> > > > > > > Since MLR uses stochastic gradient descent, you probably > have > >> to > >> > > > > > configure > >> > > > > > > the step size right. SGD is very sensitive to the right step > >> size > >> > > > > choice. > >> > > > > > > If the step size is too high, then the SGD algorithm does > not > >> > > > converge. > >> > > > > > You > >> > > > > > > can find the parameter description here [1]. > >> > > > > > > > >> > > > > > > Cheers, > >> > > > > > > Till > >> > > > > > > > >> > > > > > > [1] > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > http://ci.apache.org/projects/flink/flink-docs-master/libs/ml/multiple_linear_regression.html > >> > > > > > > > >> > > > > > > On Mon, Jun 1, 2015 at 11:48 AM, Felix Neutatz < > >> > > > neut...@googlemail.com > >> > > > > > > >> > > > > > > wrote: > >> > > > > > > > >> > > > > > > > Hi, > >> > > > > > > > > >> > > > > > > > I want to use MultipleLinearRegression, but I got really > >> > strange > >> > > > > > results. > >> > > > > > > > So I tested it with the housing price dataset: > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > http://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data > >> > > > > > > > > >> > > > > > > > And here I get negative house prices - even when I use the > >> > > training > >> > > > > set > >> > > > > > > as > >> > > > > > > > dataset: > >> > > > > > > > LabeledVector(-1.1901998613214253E78, DenseVector(1500.0, > >> > 2197.0, > >> > > > > > 2978.0, > >> > > > > > > > 1369.0, 1451.0)) > >> > > > > > > > LabeledVector(-2.7411218018254747E78, DenseVector(4445.0, > >> > 4522.0, > >> > > > > > 4038.0, > >> > > > > > > > 4223.0, 4868.0)) > >> > > > > > > > LabeledVector(-2.688526857613956E78, DenseVector(4522.0, > >> > 4038.0, > >> > > > > > 4351.0, > >> > > > > > > > 4129.0, 4617.0)) > >> > > > > > > > LabeledVector(-1.3075960386971714E78, DenseVector(2001.0, > >> > 2059.0, > >> > > > > > 1992.0, > >> > > > > > > > 2008.0, 2504.0)) > >> > > > > > > > LabeledVector(-1.476238770814297E78, DenseVector(1992.0, > >> > 1965.0, > >> > > > > > 1983.0, > >> > > > > > > > 2300.0, 3811.0)) > >> > > > > > > > LabeledVector(-1.4298128754759792E78, DenseVector(2059.0, > >> > 1992.0, > >> > > > > > 1965.0, > >> > > > > > > > 2425.0, 3178.0)) > >> > > > > > > > ... > >> > > > > > > > > >> > > > > > > > and a huge squared error: > >> > > > > > > > Squared error: 4.799184832395361E159 > >> > > > > > > > > >> > > > > > > > You can find my code here: > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://github.com/FelixNeutatz/wikiTrends/blob/master/extraction/src/test/io/sanfran/wikiTrends/extraction/flink/Regression.scala > >> > > > > > > > > >> > > > > > > > Can you help me? What did I do wrong? > >> > > > > > > > > >> > > > > > > > Thank you for your help, > >> > > > > > > > Felix > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > -- > Mikio Braun - http://blog.mikiobraun.de, http://twitter.com/mikiobraun >