Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/20472#discussion_r169391525
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
} else {
val numSplits = metadata.numSplits(featureIndex)
- // get count for each distinct value
- val (valueCountMap, numSamples) =
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
+ // get count for each distinct value except zero value
+ val (partValueCountMap, partNumSamples) =
featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
case ((m, cnt), x) =>
(m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
}
+
+ // Calculate the number of samples for finding splits
+ val numSamples: Int = (samplesFractionForFindSplits(metadata) *
metadata.numExamples).toInt
--- End diff --
The main problem I see with this is that the sampling we do for split
finding is _approximate_. Just as an example: say you have 1000 samples, and
you take 20% for split finding. Your actual sampled RDD has 220 samples in it,
and 210 of those are non-zero. So, `partNumSamples = 210`, `numSamples = 200`
and you wind up with `numSamples - partNumSamples = -10` zero values. This is
not something you expect to happen often (since we care about the highly sparse
case), but something that we need to consider. We could just require the
subtraction to be non-negative (and live with a bit of approximation), or you
could call `count` on the sampled RDD but I don't think it's worth it. Thoughts?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]