[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20472 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171184500 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { --- End diff -- have fixed it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171182634 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging { val numFeatures = metadata.numFeatures val splits: Array[Array[Split]] = Array.tabulate(numFeatures) { case i if metadata.isContinuous(i) => -val split = continuousSplits(i) +// some features may only contains zero, so continuousSplits will not have a record +val split = if (continuousSplits.contains(i)) continuousSplits(i) else Array.empty[Split] --- End diff -- have fixed it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171182291 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging { val numFeatures = metadata.numFeatures val splits: Array[Array[Split]] = Array.tabulate(numFeatures) { case i if metadata.isContinuous(i) => -val split = continuousSplits(i) +// some features may only contains zero, so continuousSplits will not have a record --- End diff -- got it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r171160692 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt --- End diff -- I have seen the note of function _sample_, and _sample_ does not guarantee to provide exactly the fraction of the count of the given RDD. It seems that requiring _numSamples - partNumSamples_ to be non-negative is a more efficient choice than trigger a _count_. The degree of approximation depends upon the degree approximation of _sample_. And it's sure that the splits will be inaccurate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r169844579 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt --- End diff -- Inaccurate zero counts can cause the change of splits, isn't it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r169391525 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt --- End diff -- The main problem I see with this is that the sampling we do for split finding is _approximate_. Just as an example: say you have 1000 samples, and you take 20% for split finding. Your actual sampled RDD has 220 samples in it, and 210 of those are non-zero. So, `partNumSamples = 210`, `numSamples = 200` and you wind up with `numSamples - partNumSamples = -10` zero values. This is not something you expect to happen often (since we care about the highly sparse case), but something that we need to consider. We could just require the subtraction to be non-negative (and live with a bit of approximation), or you could call `count` on the sampled RDD but I don't think it's worth it. Thoughts? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r169386551 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt + + // add zero value count and get complete statistics + val valueCountMap: Map[Double, Int] = partValueCountMap + (0.0 -> (numSamples - partNumSamples)) --- End diff -- There can be negative values right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r167393891 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging { val numFeatures = metadata.numFeatures val splits: Array[Array[Split]] = Array.tabulate(numFeatures) { case i if metadata.isContinuous(i) => -val split = continuousSplits(i) +// some features may only contains zero, so continuousSplits will not have a record --- End diff -- nit: 'only contains zero` -> `contain only zero`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r166771387 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt + + // add zero value count and get complete statistics + val valueCountMap: Map[Double, Int] = partValueCountMap + (0.0 -> (numSamples - partNumSamples)) --- End diff -- This also probably doesn't matter but won't the new (0.0, ...) element always come first? you could append it below after sorting the rest rather than add it earlier to the Map. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r166770380 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging { val numFeatures = metadata.numFeatures val splits: Array[Array[Split]] = Array.tabulate(numFeatures) { case i if metadata.isContinuous(i) => -val split = continuousSplits(i) +// some features may only contains zero, so continuousSplits will not have a record +val split = if (continuousSplits.contains(i)) continuousSplits(i) else Array.empty[Split] --- End diff -- Just `continuousSplits.getOrElse(i, Array.empty[Split])`? Not that it matters, but also avoids searching the map twice. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r166771033 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt --- End diff -- Just a style nit but you can drop the types on these two new vars. I think we'd consider that implicit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r166773606 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { --- End diff -- This bit of (existing) code seems obtuse to me. What about... ``` val partNumSamples = featureSamples.size val partValueCountMap = scala.collection.mutable.Map[Double,Int]() featureSamples.foreach { x => partValueCountMap(x) = partValueCountMap.getOrElse(x, 0) + 1 } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r166501185 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,19 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val requiredSamples: Long = (samplesFractionForFindSplits(metadata) * metadata.numExamples.toDouble).toLong --- End diff -- Thanks for your comments, and im sorry for my carelessness. I have fixed them all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r166249547 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1147,4 +1150,20 @@ private[spark] object RandomForest extends Logging { 3 * totalBins } } + + /** +* Calculate the subsample fraction for finding splits --- End diff -- bad indentation, this should fail scalastyle --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r166250001 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +996,19 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + val requiredSamples: Long = (samplesFractionForFindSplits(metadata) * metadata.numExamples.toDouble).toLong --- End diff -- I think here it could be: ``` val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165872418 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +1002,22 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + var requiredSamples: Long = math.max(metadata.maxBins * metadata.maxBins, 1) --- End diff -- I defined a private method _samplesFractionForFindSplits_ in RandomForest.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165671106 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1001,11 +1002,22 @@ private[spark] object RandomForest extends Logging { } else { val numSplits = metadata.numSplits(featureIndex) - // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + // get count for each distinct value except zero value + val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { case ((m, cnt), x) => (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1) } + + // Calculate the number of samples for finding splits + var requiredSamples: Long = math.max(metadata.maxBins * metadata.maxBins, 1) --- End diff -- This logic is somewhat copied from another method. Can we create new method? Or pass the result through the methods? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user lucio-yz commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165595281 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1002,10 +1008,14 @@ private[spark] object RandomForest extends Logging { val numSplits = metadata.numSplits(featureIndex) // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + var (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { --- End diff -- problem has been solved --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165378186 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -931,7 +934,8 @@ private[spark] object RandomForest extends Logging { val numFeatures = metadata.numFeatures val splits: Array[Array[Split]] = Array.tabulate(numFeatures) { case i if metadata.isContinuous(i) => -val split = continuousSplits(i) +// some features may only contains zero, so continuousSplits will not have a record +val split = if ( continuousSplits.contains(i) ) continuousSplits(i) else Array.empty[Split] --- End diff -- Remove space around parens --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165378109 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1002,10 +1008,14 @@ private[spark] object RandomForest extends Logging { val numSplits = metadata.numSplits(featureIndex) // get count for each distinct value - val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { + var (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) { --- End diff -- I'd probably avoid a `var` by using a new variable for `numSamples` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165340988 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -21,7 +21,6 @@ import java.io.IOException import scala.collection.mutable import scala.util.Random - --- End diff -- this empty line must be kept --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165341639 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging { // being spun up that will definitely do no work. val numPartitions = math.min(continuousFeatures.length, input.partitions.length) + val numInput = input.count() + val bcNumInput = input.sparkContext.broadcast(numInput) + input .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx --- End diff -- instead of adding the filter method there, here you can avoid the generation of the record itself if `point.features(idx)` is 0.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165341133 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging { // being spun up that will definitely do no work. val numPartitions = math.min(continuousFeatures.length, input.partitions.length) + val numInput = input.count() + val bcNumInput = input.sparkContext.broadcast(numInput) --- End diff -- this is not needed, you can use directly `numInput` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20472#discussion_r165343004 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging { // being spun up that will definitely do no work. val numPartitions = math.min(continuousFeatures.length, input.partitions.length) + val numInput = input.count() --- End diff -- we can get this from the `metadata.numExamples * fraction` operation in the calling method in order to avoid another job to perform the count --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...
GitHub user lucio-yz opened a pull request: https://github.com/apache/spark/pull/20472 [SPARK-22751][ML]Improve ML RandomForest shuffle performance ## What changes were proposed in this pull request? As I mentioned in [SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest), there is a shuffle performance problem in ML Randomforest when train a RF in high dimensional data. The reason is that, in org.apache.spark.tree.impl.RandomForest, the function findSplitsBySorting will actually flatmap a sparse vector into a dense vector, then in groupByKey there will be a huge shuffle write size. To avoid this, we can add a filter after flatmap, to filter out zero value. And in function findSplitsForContinuousFeature, we can infer the number of zero value by pass a parameter numInput to function findSplitsForContinuousFeature. numInput is the number of samples. In addition, if a feature only contains zero value, continuousSplits will not has the key of feature id. So I add a check when using continuousSplits. ## How was this patch tested? Ran model locally using spark-submit. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lucio-yz/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20472.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20472 commit 50cb173dd34dc353c243b97f2686a8c545a03909 Author: lucio <576632108@...> Date: 2018-02-01T09:47:52Z fix mllib randomforest shuffle issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org