[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290052#comment-17290052
 ] 

Sean R. Owen commented on SPARK-34448:
--------------------------------------

Yes I believe you're definitely correct there's a problem here. [~dbtsai] can I 
add you in here? I think you worked on the LR solver many years ago.

I skimmed the source code in sklearn and looks like the SAG solver starts with 
a 0 intercept:
https://github.com/scikit-learn/scikit-learn/blob/638b7689bbbfae4bcc4592c6f8a43ce86b571f0b/sklearn/linear_model/tests/test_sag.py#L73

Maybe ... this is the issue? I can try porting your test case to Scala to see 
if it fixes it. But the existing test suites seem to pass with a 0 initial 
intercept, at least.

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34448
>                 URL: https://issues.apache.org/jira/browse/SPARK-34448
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.4.5, 3.0.0
>            Reporter: Yakov Kerzhner
>            Priority: Major
>              Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to