Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14643#discussion_r75096884 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala --- @@ -201,11 +201,18 @@ abstract class ProbabilisticClassificationModel[ probability.argmax } else { val thresholds: Array[Double] = getThresholds - val scaledProbability: Array[Double] = - probability.toArray.zip(thresholds).map { case (p, t) => - if (t == 0.0) Double.PositiveInfinity else p / t - } - Vectors.dense(scaledProbability).argmax + + if (thresholds.contains(0.0)) { + val indices = thresholds.zipWithIndex.filter(_._1 == 0.0).map(_._2) + val values = indices.map(probability.apply) + Vectors.sparse(numClasses, indices, values).argmax + } else { + val scaledProbability: Array[Double] = + probability.toArray.zip(thresholds).map { case (p, t) => + if (t == 0.0) Double.PositiveInfinity else p / t --- End diff -- Here's an alternative take on the method that I think addresses all of these: ``` protected def probability2prediction(probability: Vector): Double = { if (!isDefined(thresholds)) { return probability.argmax } val probArray = probability.toArray val candidates = probArray.zip(getThresholds).filter { case (p, t) => p >= t } if (candidates.isEmpty) { return Double.NaN } // The threshold t notionally defines a Bernoulli distribution, as does p. // A measure of the degree to which p "exceeds" t is the distance between these // distributions, and this is the Kullback-Leibler divergence: // p ln(p/t) + (1-p )ln((1-p)/(1-t)) // We assume p >= t from above. Pick the p,t whose KL divergence is highest, taking // care to deal with p=0/1 and t=0/1 carefully. // Computes p ln(p/t) even when p = 0 def plnpt(p: Double, t: Double) = if (p == 0.0) 0.0 else p * math.log(p / t) val klDivergences = candidates.map { case (p, t) if p == t => 0.0 // because distributions match case (p, t) if t == 0.0 => Double.PositiveInfinity // because p > 0 case (p, t) => plnpt(p, t) + plnpt(1.0 - p, 1.0 - t) } val maxKL = klDivergences.max val klAndIndex = klDivergences.zipWithIndex.filter { case (kl, _) => kl == maxKL } if (klAndIndex.length == 1) { // One max? return its index val (_, maxIndex) = klAndIndex.head maxIndex } else { // Many max? break tie by conditional probability val ((_, maxIndex), _) = klAndIndex.zip(probArray).maxBy { case (_, p) => p } maxIndex } } ``` The significant changes are a slight change in behavior, but also correctly handling the case where no thresholds are exceeded. really this method should return `Option[Int]`, but the API is `Double`, so the result in this case is `NaN`. Not ideal but kind of stuck with it.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org