[ 
https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001136#comment-14001136
 ] 

Xiangrui Meng commented on SPARK-1859:
--------------------------------------

The step size should be smaller than the Lipschitz constant L. Your example 
contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To 
make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, 
it looks like a simple problem, but it is actually ill-conditioned.

scikit-learn may use line search or directly solve the least square problem, 
while we didn't implement line search in LinearRegressionWithSGD. You can try 
LBFGS in the current master, which should work for your example.

> Linear, Ridge and Lasso Regressions with SGD yield unexpected results
> ---------------------------------------------------------------------
>
>                 Key: SPARK-1859
>                 URL: https://issues.apache.org/jira/browse/SPARK-1859
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 0.9.1
>         Environment: OS: Ubuntu Server 12.04 x64
> PySpark
>            Reporter: Vlad Frolov
>              Labels: algorithm, machine_learning, regression
>
> Issue:
> Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
> (example one).
> Ridge Regression with SGD *sometimes* works ok.
> Lasso Regression with SGD *sometimes* works ok.
> Code example (PySpark) based on 
> http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
> {code:title=regression_example.py}
> parsedData = sc.parallelize([
>     array([2400., 1500.]),
>     array([240., 150.]),
>     array([24., 15.]),
>     array([2.4, 1.5]),
>     array([0.24, 0.15])
> ])
> # Build the model
> model = LinearRegressionWithSGD.train(parsedData)
> print model._coeffs
> {code}
> So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! 
> :)
> The resulting model has nan coeffs: {{array([ nan])}}.
> Furthermore, if you comment records line by line you will get:
> * [-1.55897475e+296] coeff (the first record is commented), 
> * [-8.62115396e+104] coeff (the first two records are commented),
> * etc
> It looks like the implemented regression algorithms diverges somehow.
> I get almost the same results on Ridge and Lasso.
> I've also tested these inputs in scikit-learn and it works as expected there.
> However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
> preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to