[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

sethah Tue, 20 Feb 2018 09:14:01 -0800

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r169391525
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = 
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val numSamples: Int = (samplesFractionForFindSplits(metadata) * 
metadata.numExamples).toInt
    --- End diff --
    
    The main problem I see with this is that the sampling we do for split 
finding is _approximate_. Just as an example: say you have 1000 samples, and 
you take 20% for split finding. Your actual sampled RDD has 220 samples in it, 
and 210 of those are non-zero. So, `partNumSamples = 210`, `numSamples = 200` 
and you wind up with `numSamples - partNumSamples = -10` zero values. This is 
not something you expect to happen often (since we care about the highly sparse 
case), but something that we need to consider. We could just require the 
subtraction to be non-negative (and live with a bit of approximation), or you 
could call `count` on the sampled RDD but I don't think it's worth it. Thoughts?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Reply via email to