[ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291378#comment-17291378 ]
zhengruifeng edited comment on SPARK-34448 at 2/26/21, 4:33 AM: ---------------------------------------------------------------- [~srowen] [~weichenxu123] [~ykerzhner] My findings until now: 1, as to param {{standardization, its name and doc is misleading. No matter whether it is true (by default) or false, LR always `standardize` the input vectors in a special way (x => x / std(x)), but the transformed vectors are not centered;}} {{2, for the scala testsuite above, I log the internal gradient and model (intercept & coef) at each iteration. I check the objective function and gradient, and it seems that they are calculated correctly;}} {{3, for the case with const_feature(0.9 & 1.0) above,}} the mean & std of three input features are: {code:java} featuresMean: [0.4999142959117828,1.4847274177074965,0.9899999976158129] featuresStd: [0.28501348037270735,0.28375633081273305,0.03000002215257344]{code} {{note that const_feature (its std is 0.03) will be scaled to (30.0 & 33.3).}} *{{I suspect that the underlying solvers (OWLQN/LBFGS/LBFGSB) can not handle a feature with such large(>30) values.}}* {{3.1, Since std vec affects both the internal scaling and regularization, I disable regularization by setting regParam 0.0 to see whether this scaling matters.}} With *LBFGS* Solver, the issue also exists, the solution with const_feature is: {code:java} Coefficients: [0.29713531586902586,0.1928976631256973,-0.44332696536594945] Intercept: -3.548585606117963 {code} {{ }} {{Then I manually set std vec to one values:}} {code:java} val featuresStd = Array.fill(featuresMean.length)(1.0){code} {{Then the optimization procedure behaviors as expectations, and the solution is:}} {code:java} Coefficients: [0.298868144564205,0.20101389459979044,0.008381706578824933] Intercept: -4.009204134794202 {code} {{3.2, here I reset the regParam to 0.5, with *OWLQN* Solver, the solution with all ones std is:}} {code:java} Coefficients: [0.296817926857017,0.19312282148846005,-0.17682584221569103] Intercept: -3.8124413640824466 {code} {{Compared to previous solution:}} {code:java} Coefficients: [0.2997261304455311,0.18830032771483074,-0.44301560942213103] Intercept: -3.5428941035683303 {code} {{I think the new solution with unit std vec fits better.}} {{To summary, I guess the internal standardization should center the vectors in some way to match existing solvers.}} {{TODO:}} {{1, I will refer to other impls to see how standardization is impled;}} {{2, I will go on this issue to see what will happen if the vectors are centered;}} {{3, This issue may also exist in LiR/SVC/etc. I will check in the future;}} was (Author: podongfeng): [~srowen] [~weichenxu123] [~ykerzhner] My findings until now: 1, as to param {{standardization, its name and doc is misleading. No matter whether it is true (by default) or false, LR always `standardize` the input vectors in a special way (x => x / std(x)), but the transformed vectors are not centered;}} {{2, for the scala testsuite above, I log the internal gradient and model (intercept & coef) at each iteration. I check the objective function and gradient, and it seems that they are calculated correctly;}} {{3, for the case with const_feature(0.9 & 1.0) above,}} the mean & std of three input features are: {code:java} featuresMean: [0.4999142959117828,1.4847274177074965,0.9899999976158129] featuresStd: [0.28501348037270735,0.28375633081273305,0.03000002215257344]{code} {{note that const_feature (its std is 0.03) will be scaled to (30.0 & 33.3).}} *{{I suspect that the underlying solvers (OWLQN/LBFGS/LBFGSB) can not handle a feature with such large(>30) values.}}* {{3.1, Since std vec affects both the internal scaling and regularization, I disable regularization by setting regParam 0.0 to see whether this scaling matters.}} With *LBFGS* Solver, the issue also exists, the solution with const_feature is: {code:java} Coefficients: [0.29713531586902586,0.1928976631256973,-0.44332696536594945] Intercept: -3.548585606117963 {code} {{ }} {{Then I manually set std vec to one values:}} {code:java} val featuresStd = Array.fill(featuresMean.length)(1.0){code} {{Then the optimization procedure behaviors as expectations, and the solution is:}} {code:java} Coefficients: [0.298868144564205,0.20101389459979044,0.008381706578824933] Intercept: -4.009204134794202 {code} {{3.2, here I reset the regParam to 0.5, with *OWLQN* Solver, the solution with all ones std is:}} {code:java} Coefficients: [0.296817926857017,0.19312282148846005,-0.17682584221569103] Intercept: -3.8124413640824466 {code} {{Compared to previous solution:}} {code:java} Coefficients: [0.2997261304455311,0.18830032771483074,-0.44301560942213103] Intercept: -3.5428941035683303 {code} {{I think the new solution with unit std vec fits better.}} {{To summary, I guess the internal standardization should center the vectors in some way to match existing solver.}} {{TODO:}} {{1, I will refer to other impls to see how standardization is impled;}} {{2, I will go on this issue to see what will happen if the vectors are centered;}} {{3, This issue may also exist in LiR/SVC/etc. I will check in the future;}} > Binary logistic regression incorrectly computes the intercept and > coefficients when data is not centered > -------------------------------------------------------------------------------------------------------- > > Key: SPARK-34448 > URL: https://issues.apache.org/jira/browse/SPARK-34448 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.4.5, 3.0.0 > Reporter: Yakov Kerzhner > Priority: Major > Labels: correctness > > I have written up a fairly detailed gist that includes code to reproduce the > bug, as well as the output of the code and some commentary: > [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96] > To summarize: under certain conditions, the minimization that fits a binary > logistic regression contains a bug that pulls the intercept value towards the > log(odds) of the target data. This is mathematically only correct when the > data comes from distributions with zero means. In general, this gives > incorrect intercept values, and consequently incorrect coefficients as well. > As I am not so familiar with the spark code base, I have not been able to > find this bug within the spark code itself. A hint to this bug is here: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904] > based on the code, I don't believe that the features have zero means at this > point, and so this heuristic is incorrect. But an incorrect starting point > does not explain this bug. The minimizer should drift to the correct place. > I was not able to find the code of the actual objective function that is > being minimized. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org