Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14643#discussion_r74774476
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
 ---
    @@ -201,11 +201,18 @@ abstract class ProbabilisticClassificationModel[
           probability.argmax
         } else {
           val thresholds: Array[Double] = getThresholds
    -      val scaledProbability: Array[Double] =
    -        probability.toArray.zip(thresholds).map { case (p, t) =>
    -          if (t == 0.0) Double.PositiveInfinity else p / t
    -        }
    -      Vectors.dense(scaledProbability).argmax
    +
    +      if (thresholds.contains(0.0)) {
    +        val indices = thresholds.zipWithIndex.filter(_._1 == 0.0).map(_._2)
    +        val values = indices.map(probability.apply)
    +        Vectors.sparse(numClasses, indices, values).argmax
    +      } else {
    +        val scaledProbability: Array[Double] =
    +          probability.toArray.zip(thresholds).map { case (p, t) =>
    +            if (t == 0.0) Double.PositiveInfinity else p / t
    --- End diff --
    
    Yeah, there are maybe four closely-related issues. 
    
    - what to do with t = 0 in more than one class, and it should probably (?) 
be an error to even specify this
    - what about t = 1, especially the case that all are 1?
    - what about the case where nothing exceeds a threshold at all?
    - is the ratio computation the best way to choose among several classes 
that exceed the threshold?
    
    K-L divergence actually gives a different answer. It would rank p=0.5/t=0.2 
_higher_ than p=0.3/t=0.1, whereas the current rule is the reverse. I think K-L 
is more theoretically sound (though someone may tell me there's an equivalent 
or simpler way to think of it). However that is the most separable question  of 
the 4.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to