zhengruifeng commented on pull request #31693: URL: https://github.com/apache/spark/pull/31693#issuecomment-790263832
according to the comments [here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L33-L43): ``` * For improving the convergence rate during the optimization process and also to prevent against * features with very large variances exerting an overly large influence during model training, * packages like R's GLMNET perform the scaling to unit variance and remove the mean in order to * reduce the condition number. The model is then trained in this scaled space, but returns the * coefficients in the original scale. See page 9 in * http://cran.r-project.org/web/packages/glmnet/glmnet.pdf * * However, we don't want to apply the [[org.apache.spark.ml.feature.StandardScaler]] on the * training dataset, and then cache the standardized dataset since it will create a lot of overhead. * As a result, we perform the scaling implicitly when we compute the objective function (though * we do not subtract the mean). ``` I think the change of centering vectors should also be theoretically reasonable. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
