Wayne Zhang created SPARK-18701:
-----------------------------------
Summary: Poisson GLM fails due to wrong initialization
Key: SPARK-18701
URL: https://issues.apache.org/jira/browse/SPARK-18701
Project: Spark
Issue Type: New Feature
Components: ML
Affects Versions: 2.0.2
Reporter: Wayne Zhang
Priority: Critical
Fix For: 2.2.0
Poisson GLM fails for many standard data sets. The issue is incorrect
initialization leading to almost zero probability and weights. The following
simple example reproduces the error.
{code:borderStyle=solid}
val datasetPoissonLogWithZero = Seq(
LabeledPoint(0.0, Vectors.dense(18, 1.0)),
LabeledPoint(1.0, Vectors.dense(12, 0.0)),
LabeledPoint(0.0, Vectors.dense(15, 0.0)),
LabeledPoint(0.0, Vectors.dense(13, 2.0)),
LabeledPoint(0.0, Vectors.dense(15, 1.0)),
LabeledPoint(1.0, Vectors.dense(16, 1.0)),
LabeledPoint(0.0, Vectors.dense(10, 0.0)),
LabeledPoint(0.0, Vectors.dense(15, 0.0)),
LabeledPoint(0.0, Vectors.dense(12, 2.0)),
LabeledPoint(0.0, Vectors.dense(13, 0.0)),
LabeledPoint(1.0, Vectors.dense(15, 0.0)),
LabeledPoint(1.0, Vectors.dense(15, 0.0)),
LabeledPoint(0.0, Vectors.dense(15, 0.0)),
LabeledPoint(0.0, Vectors.dense(12, 2.0)),
LabeledPoint(1.0, Vectors.dense(12, 2.0))
).toDF()
val glr = new GeneralizedLinearRegression()
.setFamily("poisson")
.setLink("log")
.setMaxIter(20)
.setRegParam(0)
val model = glr.fit(datasetPoissonLogWithZero)
{code}
The issue is in the initialization: the mean is initialized as the response,
which could be zero. Applying the log link results in very negative numbers
(protected against -Inf), which again leads to close to zero probability and
weights in the weighted least squares. The fix is easy: just add a small
constant, highlighted in red below.
override def initialize(y: Double, weight: Double): Double = {
require(y >= 0.0, "The response variable of Poisson family " +
s"should be non-negative, but got $y")
y {color:red}+ 0.1 {color}
}
I already have a fix and test code. Will create a PR.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]