[ 
https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002572#comment-14002572
 ] 

Vlad Frolov commented on SPARK-1859:
------------------------------------

Thank you for your detailed reply! I'm going to try LBFGS today.

However, I've tried your suggestion to use step size < (1.0/1500/1500) and the 
result is "not as bad as before". I've tried different step sizes:
(1.0 / (1500 ** 2)): coeff = 1.58746901
(1.0 / (1200 ** 2)): coeff = 1.5987283
(1.0 / (1100 ** 2)): coeff = 1.599634
(1.0 / (1000 ** 2)): coeff = 1.59993178
(1.0 / (750 ** 2)): coeff = 1.59999966
(1.0 / (...... ** 2)): coeff = 1.59999968
(1.0 / (200 ** 2)): coeff = 1.59999968
(1.0 / (188 ** 2)): coeff = 1.59966697
(1.0 / (187 ** 2)): coeff = 1.59571803
(1.0 / (186 ** 2)): coeff = 1.54666248
(1.0 / (185 ** 2)): coeff = 0.95557116

Here are some conclusions:
1) (1.0/ (1500 ** 2)) is not that bad guess, but 200 .. 750 works better.
2) I've tried step sizes less then (1.0 / (1500 ** 2)), but they give even 
worse results

> Linear, Ridge and Lasso Regressions with SGD yield unexpected results
> ---------------------------------------------------------------------
>
>                 Key: SPARK-1859
>                 URL: https://issues.apache.org/jira/browse/SPARK-1859
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 0.9.1
>         Environment: OS: Ubuntu Server 12.04 x64
> PySpark
>            Reporter: Vlad Frolov
>              Labels: algorithm, machine_learning, regression
>
> Issue:
> Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
> (example one).
> Ridge Regression with SGD *sometimes* works ok.
> Lasso Regression with SGD *sometimes* works ok.
> Code example (PySpark) based on 
> http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
> {code:title=regression_example.py}
> parsedData = sc.parallelize([
>     array([2400., 1500.]),
>     array([240., 150.]),
>     array([24., 15.]),
>     array([2.4, 1.5]),
>     array([0.24, 0.15])
> ])
> # Build the model
> model = LinearRegressionWithSGD.train(parsedData)
> print model._coeffs
> {code}
> So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! 
> :)
> The resulting model has nan coeffs: {{array([ nan])}}.
> Furthermore, if you comment records line by line you will get:
> * [-1.55897475e+296] coeff (the first record is commented), 
> * [-8.62115396e+104] coeff (the first two records are commented),
> * etc
> It looks like the implemented regression algorithms diverges somehow.
> I get almost the same results on Ridge and Lasso.
> I've also tested these inputs in scikit-learn and it works as expected there.
> However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
> preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to