mengxr commented on pull request #31693: URL: https://github.com/apache/spark/pull/31693#issuecomment-795017955
This is my understanding of the behavior. Because we didn't center the columns, when there is a near-constant column, after std scaling the values become very large. As a result, the coefficient corresponding to that column is very small, insensitive to regularization if any. The algorithm used to rely on regularization to push the extra weights to intercept, but now ineffective. So the weight can shift freely between the intercept and the feature weight. This PR does the "virtual" centering and we should calculate the initial intercept after centering. Those are good changes. However, I'm not sure if it is necessary to backport the changes to old releases. Because the old approach might still produce a "correct" model in terms of making similar predictions, although the coefficients might converge slowly. I asked @zhengruifeng to test it. If it is not a correctness bug, we might save the effort of backporting the change. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
