Vlad Frolov created SPARK-1859: ---------------------------------- Summary: Linear, Ridge and Lasso Regressions with SGD yield unexpected results Key: SPARK-1859 URL: https://issues.apache.org/jira/browse/SPARK-1859 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Environment: OS: Ubuntu Server 12.04 x64 PySpark Reporter: Vlad Frolov
Issue: Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one). Ridge Regression with SGD *sometimes* works ok. Lasso Regression with SGD *sometimes* works ok. Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 : {code:title=regression_example.py} parsedData = sc.parallelize([ array([2400., 1500.]), array([240., 150.]), array([24., 15.]), array([2.4, 1.5]), array([0.24, 0.15]) ]) # Build the model model = LinearRegressionWithSGD.train(parsedData) print model._coeffs {code} So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :) The resulting model has nan coeffs: {{array([ nan])}}. Furthermore, if you comment records line by line you will get: * [-1.55897475e+296] coeff (the first record is commented), * [-8.62115396e+104] coeff (the first two records are commented), * etc It looks like the implemented regression algorithms diverges somehow. I get almost the same results on Ridge and Lasso. I've also tested these inputs in scikit-learn and it works as expected there. However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow? -- This message was sent by Atlassian JIRA (v6.2#6252)