[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

Yakov Kerzhner (Jira) Fri, 26 Feb 2021 08:52:05 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291757#comment-17291757
 ]


Yakov Kerzhner commented on SPARK-34448:
----------------------------------------

The faster convergence when using standardization that includes centering makes 
sense as you can ahead of time guess the value of the intercept (it should 
equal the log(odds)).   What I still don't understand is how is it that in the 
case that the data is not centered, the intercept after the minimization is 
almost exactly equal to the log(odds).  This seems extremely strange to me and 
I can't find a mathematical reason for this to be happening.  In the original 
test example, could you print out the x, f(x), grad(x) as the minimizer moves 
from (0, 0, 0, log(odds)) to the minimum?

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34448
>                 URL: https://issues.apache.org/jira/browse/SPARK-34448
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: Yakov Kerzhner
>            Priority: Major
>              Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

Reply via email to