Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14643#discussion_r75096884
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
 ---
    @@ -201,11 +201,18 @@ abstract class ProbabilisticClassificationModel[
           probability.argmax
         } else {
           val thresholds: Array[Double] = getThresholds
    -      val scaledProbability: Array[Double] =
    -        probability.toArray.zip(thresholds).map { case (p, t) =>
    -          if (t == 0.0) Double.PositiveInfinity else p / t
    -        }
    -      Vectors.dense(scaledProbability).argmax
    +
    +      if (thresholds.contains(0.0)) {
    +        val indices = thresholds.zipWithIndex.filter(_._1 == 0.0).map(_._2)
    +        val values = indices.map(probability.apply)
    +        Vectors.sparse(numClasses, indices, values).argmax
    +      } else {
    +        val scaledProbability: Array[Double] =
    +          probability.toArray.zip(thresholds).map { case (p, t) =>
    +            if (t == 0.0) Double.PositiveInfinity else p / t
    --- End diff --
    
    Here's an alternative take on the method that I think addresses all of 
these:
    
    ```
      protected def probability2prediction(probability: Vector): Double = {
        if (!isDefined(thresholds)) {
          return probability.argmax
        }
    
        val probArray = probability.toArray
        val candidates = probArray.zip(getThresholds).filter { case (p, t) => p 
>= t }
        if (candidates.isEmpty) {
          return Double.NaN
        }
    
        // The threshold t notionally defines a Bernoulli distribution, as does 
p.
        // A measure of the degree to which p "exceeds" t is the distance 
between these
        // distributions, and this is the Kullback-Leibler divergence:
        // p ln(p/t) + (1-p )ln((1-p)/(1-t))
        // We assume p >= t from above. Pick the p,t whose KL divergence is 
highest, taking
        // care to deal with p=0/1 and t=0/1 carefully.
    
        // Computes p ln(p/t) even when p = 0
        def plnpt(p: Double, t: Double) = if (p == 0.0) 0.0 else p * math.log(p 
/ t)
    
        val klDivergences = candidates.map {
          case (p, t) if p == t => 0.0 // because distributions match
          case (p, t) if t == 0.0 => Double.PositiveInfinity // because p > 0
          case (p, t) => plnpt(p, t) + plnpt(1.0 - p, 1.0 - t)
        }
    
        val maxKL = klDivergences.max
        val klAndIndex = klDivergences.zipWithIndex.filter { case (kl, _) => kl 
== maxKL }
        if (klAndIndex.length == 1) {
          // One max? return its index
          val (_, maxIndex) = klAndIndex.head
          maxIndex
        } else {
          // Many max? break tie by conditional probability
          val ((_, maxIndex), _) = klAndIndex.zip(probArray).maxBy { case (_, 
p) => p }
          maxIndex
        }
      }
    ```
    
    The significant changes are a slight change in behavior, but also correctly 
handling the case where no thresholds are exceeded. 
    
    really this method should return `Option[Int]`, but the API is `Double`, so 
the result in this case is `NaN`. Not ideal but kind of stuck with it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to