Vlad Frolov created SPARK-1859:
----------------------------------

             Summary: Linear, Ridge and Lasso Regressions with SGD yield 
unexpected results
                 Key: SPARK-1859
                 URL: https://issues.apache.org/jira/browse/SPARK-1859
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 0.9.1
         Environment: OS: Ubuntu Server 12.04 x64
PySpark
            Reporter: Vlad Frolov


Issue:
Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
(example one).
Ridge Regression with SGD *sometimes* works ok.
Lasso Regression with SGD *sometimes* works ok.

Code example (PySpark) based on 
http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
{code:title=regression_example.py}
parsedData = sc.parallelize([
    array([2400., 1500.]),
    array([240., 150.]),
    array([24., 15.]),
    array([2.4, 1.5]),
    array([0.24, 0.15])
])

# Build the model
model = LinearRegressionWithSGD.train(parsedData)
print model._coeffs
{code}

So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :)
The resulting model has nan coeffs: {{array([ nan])}}.
Furthermore, if you comment records line by line you will get:
* [-1.55897475e+296] coeff (the first record is commented), 
* [-8.62115396e+104] coeff (the first two records are commented),
* etc

It looks like the implemented regression algorithms diverges somehow.

I get almost the same results on Ridge and Lasso.

I've also tested these inputs in scikit-learn and it works as expected there.

However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to