GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/14326
[SPARK-3181] [ML] Implement RobustRegression with huber loss. ## What changes were proposed in this pull request? The current implementation is a straight forward porting for Python scikit-learn ```HuberRegressor```, so it produces the same result with that. The code is used for discussion and please overpass trivial issues now, since I think we may have slightly different idea for our Spark implementation. Here I listed some major issues should be discussed: * Objective function. We use Eq.(6) in [A robust hybrid of lasso and ridge regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf) as the objective function. ![image](https://cloud.githubusercontent.com/assets/1962026/17076521/02a3f054-5069-11e6-895d-3c904e056ba2.png) But the convention is different from other Spark ML code such as ```LinearRegression``` in two aspects: ⢠The loss is total loss rather than mean loss. We use ```lossSum/weightSum``` as the mean loss in ```LinearRegression```. ⢠We do not multiply the loss function and L2 regularization by 1/2. This is not a problem since it does not affect the result if we multiply the whole formula by a factor. So should we turn to use the modified objective function like following which will be consistent with other Spark ML code? ![image](https://cloud.githubusercontent.com/assets/1962026/17076522/14eceb4e-5069-11e6-84ae-ecfaf3ea12ed.png) * Implement a new class ```RobustRegression``` or a new loss function for ```LinearRegression```. Both ```LinearRegression``` and ```RobustRegression``` accomplish the same goal, but the output of ```fit``` will be different: ```LinearRegressionModel``` and ```RobustRegressionModel```. The former only contains ```coefficients```, ```intercept```; but the latter contains ```coefficients```, ```intercept```, ```scale/sigma``` (and even the outlier samples similar to sklearn ```HuberRegressor.outliers_```). It will also involve save/load compatibility issue if we combine the two models become one. One trick method is we can drop ```scale/sigma``` and make the ```fit``` by this huber cost function still output ```LinearRegressionModel```, but I don't think it's an appropriate way since it will miss some model attributes. So I implemented ```RobustRegression``` in a new class, and we can port this loss function to ```LinearRegression``` if needed at later time. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-3181 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14326.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14326 ---- commit 8fd0ca1954f964e89cf81379fdaff0844afd7253 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-07-23T06:54:58Z Implement RobustRegression with huber loss. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org