GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/14326

    [SPARK-3181] [ML] Implement RobustRegression with huber loss.

    ## What changes were proposed in this pull request?
    The current implementation is a straight forward porting for Python 
scikit-learn ```HuberRegressor```, so it produces the same result with that.
    The code is used for discussion and please overpass trivial issues now, 
since I think we may have slightly different idea for our Spark implementation.
    
    Here I listed some major issues should be discussed:
    * Objective function.
    
    We use Eq.(6) in [A robust hybrid of lasso and ridge 
regression](http://statweb.stanford.edu/~owen/reports/hhu.pdf) as the objective 
function.
    
![image](https://cloud.githubusercontent.com/assets/1962026/17076521/02a3f054-5069-11e6-895d-3c904e056ba2.png)
    But the convention is different from other Spark ML code such as 
```LinearRegression``` in two aspects:
    • The loss is total loss rather than mean loss. We use 
```lossSum/weightSum``` as the mean loss in ```LinearRegression```.
    • We do not multiply the loss function and L2 regularization by 1/2. This 
is not a problem since it does not affect the result if we multiply the whole 
formula by a factor.
    So should we turn to use the modified objective function like following 
which will be consistent with other Spark ML code?
    
![image](https://cloud.githubusercontent.com/assets/1962026/17076522/14eceb4e-5069-11e6-84ae-ecfaf3ea12ed.png)
    * Implement a new class ```RobustRegression``` or a new loss function for 
```LinearRegression```.
    
    Both ```LinearRegression``` and ```RobustRegression``` accomplish the same 
goal, but the output of ```fit``` will be different: 
```LinearRegressionModel``` and ```RobustRegressionModel```. The former only 
contains ```coefficients```, ```intercept```; but the latter contains 
```coefficients```, ```intercept```, ```scale/sigma``` (and even the outlier 
samples similar to sklearn ```HuberRegressor.outliers_```). It will also 
involve save/load compatibility issue if we combine the two models become one. 
One trick method is we can drop ```scale/sigma``` and make the ```fit``` by 
this huber cost function still output ```LinearRegressionModel```, but I don't 
think it's an appropriate way since it will miss some model attributes. So I 
implemented ```RobustRegression``` in a new class, and we can port this loss 
function to ```LinearRegression``` if needed at later time. 
    
    ## How was this patch tested?
    Unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-3181

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14326.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14326
    
----
commit 8fd0ca1954f964e89cf81379fdaff0844afd7253
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-07-23T06:54:58Z

    Implement RobustRegression with huber loss.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to