[
https://issues.apache.org/jira/browse/SPARK-18701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15719212#comment-15719212
]
Apache Spark commented on SPARK-18701:
--------------------------------------
User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16131
> Poisson GLM fails due to wrong initialization
> ---------------------------------------------
>
> Key: SPARK-18701
> URL: https://issues.apache.org/jira/browse/SPARK-18701
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Affects Versions: 2.0.2
> Reporter: Wayne Zhang
> Priority: Critical
> Fix For: 2.2.0
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Poisson GLM fails for many standard data sets. The issue is incorrect
> initialization leading to almost zero probability and weights. The following
> simple example reproduces the error.
> {code:borderStyle=solid}
> val datasetPoissonLogWithZero = Seq(
> LabeledPoint(0.0, Vectors.dense(18, 1.0)),
> LabeledPoint(1.0, Vectors.dense(12, 0.0)),
> LabeledPoint(0.0, Vectors.dense(15, 0.0)),
> LabeledPoint(0.0, Vectors.dense(13, 2.0)),
> LabeledPoint(0.0, Vectors.dense(15, 1.0)),
> LabeledPoint(1.0, Vectors.dense(16, 1.0)),
> LabeledPoint(0.0, Vectors.dense(10, 0.0)),
> LabeledPoint(0.0, Vectors.dense(15, 0.0)),
> LabeledPoint(0.0, Vectors.dense(12, 2.0)),
> LabeledPoint(0.0, Vectors.dense(13, 0.0)),
> LabeledPoint(1.0, Vectors.dense(15, 0.0)),
> LabeledPoint(1.0, Vectors.dense(15, 0.0)),
> LabeledPoint(0.0, Vectors.dense(15, 0.0)),
> LabeledPoint(0.0, Vectors.dense(12, 2.0)),
> LabeledPoint(1.0, Vectors.dense(12, 2.0))
> ).toDF()
>
> val glr = new GeneralizedLinearRegression()
> .setFamily("poisson")
> .setLink("log")
> .setMaxIter(20)
> .setRegParam(0)
> val model = glr.fit(datasetPoissonLogWithZero)
> {code}
> The issue is in the initialization: the mean is initialized as the response,
> which could be zero. Applying the log link results in very negative numbers
> (protected against -Inf), which again leads to close to zero probability and
> weights in the weighted least squares. The fix is easy: just add a small
> constant, highlighted in red below.
>
> override def initialize(y: Double, weight: Double): Double = {
> require(y >= 0.0, "The response variable of Poisson family " +
> s"should be non-negative, but got $y")
> y {color:red}+ 0.1 {color}
> }
> I already have a fix and test code. Will create a PR.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]