[GitHub] [spark] zhengruifeng commented on pull request #31693: [SPARK-34448][ML][WIP] Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

GitBox Wed, 03 Mar 2021 19:35:18 -0800


zhengruifeng commented on pull request #31693:
URL: https://github.com/apache/spark/pull/31693#issuecomment-790263832



   according to the comments 
[here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L33-L43):
   ```
    * For improving the convergence rate during the optimization process and 
also to prevent against
    * features with very large variances exerting an overly large influence 
during model training,
    * packages like R's GLMNET perform the scaling to unit variance and remove 
the mean in order to
    * reduce the condition number. The model is then trained in this scaled 
space, but returns the
    * coefficients in the original scale. See page 9 in
    * http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
    *
    * However, we don't want to apply the 
[[org.apache.spark.ml.feature.StandardScaler]] on the
    * training dataset, and then cache the standardized dataset since it will 
create a lot of overhead.
    * As a result, we perform the scaling implicitly when we compute the 
objective function (though
    * we do not subtract the mean).
   ```
   
   I think the change of centering vectors should also be theoretically 
reasonable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on pull request #31693: [SPARK-34448][ML][WIP] Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

Reply via email to