[GitHub] [spark] mengxr commented on pull request #31693: [SPARK-34448][ML] Binary logistic regression incorrectly computes the intercept and coefficients with small var features

GitBox Tue, 09 Mar 2021 23:28:34 -0800


mengxr commented on pull request #31693:
URL: https://github.com/apache/spark/pull/31693#issuecomment-795017955



   This is my understanding of the behavior. Because we didn't center the 
columns, when there is a near-constant column, after std scaling the values 
become very large. As a result, the coefficient corresponding to that column is 
very small, insensitive to regularization if any. The algorithm used to rely on 
regularization to push the extra weights to intercept, but now ineffective. So 
the weight can shift freely between the intercept and the feature weight.
   
   This PR does the "virtual" centering and we should calculate the initial 
intercept after centering. Those are good changes.
   
   However, I'm not sure if it is necessary to backport the changes to old 
releases. Because the old approach might still produce a "correct" model in 
terms of making similar predictions, although the coefficients might converge 
slowly. I asked @zhengruifeng to test it. If it is not a correctness bug, we 
might save the effort of backporting the change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mengxr commented on pull request #31693: [SPARK-34448][ML] Binary logistic regression incorrectly computes the intercept and coefficients with small var features

Reply via email to