Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20472#discussion_r165341639
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging {
// being spun up that will definitely do no work.
val numPartitions = math.min(continuousFeatures.length,
input.partitions.length)
+ val numInput = input.count()
+ val bcNumInput = input.sparkContext.broadcast(numInput)
+
input
.flatMap(point => continuousFeatures.map(idx => (idx,
point.features(idx))))
--- End diff --
instead of adding the filter method there, here you can avoid the
generation of the record itself if `point.features(idx)` is 0.0
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]