[
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291378#comment-17291378
]
zhengruifeng edited comment on SPARK-34448 at 2/26/21, 4:58 AM:
----------------------------------------------------------------
[~srowen] [~weichenxu123] [~ykerzhner]
My findings until now:
1, as to param {{standardization, its name and doc is misleading. No matter
whether it is true (by default) or false, LR always `standardize` the input
vectors in a special way (x => x / std(x)), but the transformed vectors are not
centered;}}
{{2, for the scala testsuite above, I log the internal gradient and model
(intercept & coef) at each iteration. I check the objective function and
gradient, and it seems that they are calculated correctly;}}
{{3, for the case with const_feature(0.9 & 1.0) above,}} the mean & std of
three input features are:
{code:java}
featuresMean: [0.4999142959117828,1.4847274177074965,0.9899999976158129]
featuresStd: [0.28501348037270735,0.28375633081273305,0.03000002215257344]{code}
{{note that const_feature (its std is 0.03) will be scaled to (30.0 & 33.3).}}
*{{I suspect that the underlying solvers (OWLQN/LBFGS/LBFGSB) can not handle a
feature with such large(>30) values.}}*
{{3.1, Since std vec affects both the internal scaling and regularization, I
disable regularization by setting regParam 0.0 to see whether this scaling
matters.}}
With *LBFGS* Solver, the issue also exists, the solution with const_feature is:
{code:java}
Coefficients: [0.29713531586902586,0.1928976631256973,-0.44332696536594945]
Intercept: -3.548585606117963 {code}
{{ }}
{{Then I manually set std vec to one values:}}
{code:java}
val featuresStd = Array.fill(featuresMean.length)(1.0){code}
{{Then the optimization procedure behaviors as expectations, and the solution
is:}}
{code:java}
Coefficients: [0.298868144564205,0.20101389459979044,0.008381706578824933]
Intercept: -4.009204134794202 {code}
{{3.2, here I reset the regParam, with *OWLQN* Solver, the solution with all
ones std is:}}
{code:java}
Coefficients: [0.296817926857017,0.19312282148846005,-0.17682584221569103]
Intercept: -3.8124413640824466 {code}
{{Compared with previous solution:}}
{code:java}
Coefficients: [0.2997261304455311,0.18830032771483074,-0.44301560942213103]
Intercept: -3.5428941035683303 {code}
{{I think the new solution with unit std vec fits better.}}
{{To sum up, I guess the internal standardization should center the vectors in
some way to match existing solvers.}}
{{TODO:}}
{{1, I will refer to other impls to see how standardization is impled;}}
{{2, I will go on this issue to see what will happen if the vectors are
centered;}}
{{3, This issue may also exist in LiR/SVC/etc. I will check in the future;}}
was (Author: podongfeng):
[~srowen] [~weichenxu123] [~ykerzhner]
My findings until now:
1, as to param {{standardization, its name and doc is misleading. No matter
whether it is true (by default) or false, LR always `standardize` the input
vectors in a special way (x => x / std(x)), but the transformed vectors are not
centered;}}
{{2, for the scala testsuite above, I log the internal gradient and model
(intercept & coef) at each iteration. I check the objective function and
gradient, and it seems that they are calculated correctly;}}
{{3, for the case with const_feature(0.9 & 1.0) above,}} the mean & std of
three input features are:
{code:java}
featuresMean: [0.4999142959117828,1.4847274177074965,0.9899999976158129]
featuresStd: [0.28501348037270735,0.28375633081273305,0.03000002215257344]{code}
{{note that const_feature (its std is 0.03) will be scaled to (30.0 & 33.3).}}
*{{I suspect that the underlying solvers (OWLQN/LBFGS/LBFGSB) can not handle a
feature with such large(>30) values.}}*
{{3.1, Since std vec affects both the internal scaling and regularization, I
disable regularization by setting regParam 0.0 to see whether this scaling
matters.}}
With *LBFGS* Solver, the issue also exists, the solution with const_feature is:
{code:java}
Coefficients: [0.29713531586902586,0.1928976631256973,-0.44332696536594945]
Intercept: -3.548585606117963 {code}
{{ }}
{{Then I manually set std vec to one values:}}
{code:java}
val featuresStd = Array.fill(featuresMean.length)(1.0){code}
{{Then the optimization procedure behaviors as expectations, and the solution
is:}}
{code:java}
Coefficients: [0.298868144564205,0.20101389459979044,0.008381706578824933]
Intercept: -4.009204134794202 {code}
{{3.2, here I reset the regParam to 0.5, with *OWLQN* Solver, the solution with
all ones std is:}}
{code:java}
Coefficients: [0.296817926857017,0.19312282148846005,-0.17682584221569103]
Intercept: -3.8124413640824466 {code}
{{Compared with previous solution:}}
{code:java}
Coefficients: [0.2997261304455311,0.18830032771483074,-0.44301560942213103]
Intercept: -3.5428941035683303 {code}
{{I think the new solution with unit std vec fits better.}}
{{To sum up, I guess the internal standardization should center the vectors in
some way to match existing solvers.}}
{{TODO:}}
{{1, I will refer to other impls to see how standardization is impled;}}
{{2, I will go on this issue to see what will happen if the vectors are
centered;}}
{{3, This issue may also exist in LiR/SVC/etc. I will check in the future;}}
> Binary logistic regression incorrectly computes the intercept and
> coefficients when data is not centered
> --------------------------------------------------------------------------------------------------------
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib
> Affects Versions: 2.4.5, 3.0.0
> Reporter: Yakov Kerzhner
> Priority: Major
> Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary
> logistic regression contains a bug that pulls the intercept value towards the
> log(odds) of the target data. This is mathematically only correct when the
> data comes from distributions with zero means. In general, this gives
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to
> find this bug within the spark code itself. A hint to this bug is here:
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this
> point, and so this heuristic is incorrect. But an incorrect starting point
> does not explain this bug. The minimizer should drift to the correct place.
> I was not able to find the code of the actual objective function that is
> being minimized.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]