Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16149#discussion_r90968163
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
    @@ -479,7 +479,12 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
             numInstances: Double,
             weightSum: Double): Double = {
           -2.0 * predictions.map { case (y: Double, mu: Double, weight: 
Double) =>
    -        weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
    +        val wt = math.round(weight).toInt
    +        if (wt == 0) {
    +          0.0
    +        } else {
    +          dist.Binomial(wt, mu).logProbabilityOf(math.round(y * 
weight).toInt)
    --- End diff --
    
    I think I understand the problem, because a weighted instance is treated 
like _n_ copies of one instance, whose likelihood is treated like a Bernoulli 
trial (this could have been much simpler in the existing code right -- log(mu) 
or log(1-mu) depending on y?). 
    
    This loses some info because y is rounded, so a weight=100 instance with 
y=0.5 is treated like 100 instance of y=1, when it kind of should be treated 
like 50 instances each of y=0 and y=1.
    
    This matters not must for non-integer weights right? you would also get a 
different answer in the case above.
    
    I am out of my depth here, but is this the best generalization? noticeably, 
you have to round the weight, making this inaccurate for non-integer weights.
    
    You have to handle wt=0 separately because dist.Binomial will reject it? 
Binomial with n=0 ought to be well defined. The problem is this says the log 
probability of an instance with weight<0.5 is always 0, but that's not 'true' 
-- I guess it depends on how one defines these weights.
    
    Does this match what R does or something? I'm trying to figure out if this 
is the right thing to do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to