[GitHub] spark pull request: SPARK-4547 [MLLIB] OOM when making bins in Bin...

jkbradley Mon, 22 Dec 2014 11:06:05 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3702#discussion_r22183575
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
    @@ -103,7 +117,37 @@ class BinaryClassificationMetrics(scoreAndLabels: 
RDD[(Double, Double)]) extends
           mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
           mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => 
c1 += c2
         ).sortByKey(ascending = false)
    -    val agg = counts.values.mapPartitions { iter =>
    +
    +    val binnedCounts =
    +      // Only down-sample if bins is > 0
    +      if (numBins == 0) {
    +        // Use original directly
    +        counts
    +      } else {
    +        val countsSize = counts.count()
    +        // Group the iterator into chunks of about countsSize / numBins 
points,
    +        // so that the resulting number of bins is about numBins
    +        val grouping = countsSize / numBins
    +        if (grouping < 2) {
    +          // numBins was more than half of the size; no real point in 
down-sampling to bins
    +          logInfo(s"Curve is too small ($countsSize) for $numBins bins to 
be useful")
    +          counts
    +        } else if (grouping >= Int.MaxValue) {
    +          logWarning(s"Curve is too large ($countsSize) for $numBins bins; 
ignoring")
    --- End diff --
    
    I think this should set grouping to Int.MaxValue  (and print a warning) 
since it is these really big datasets which cause problems.  The default 
behavior should avoid failure.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-4547 [MLLIB] OOM when making bins in Bin...

Reply via email to