[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

mgaido91 Thu, 01 Feb 2018 04:34:28 -0800

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165341639
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging {
           // being spun up that will definitely do no work.
           val numPartitions = math.min(continuousFeatures.length, 
input.partitions.length)
     
    +      val numInput = input.count()
    +      val bcNumInput = input.sparkContext.broadcast(numInput)
    +
           input
             .flatMap(point => continuousFeatures.map(idx => (idx, 
point.features(idx))))
    --- End diff --
    
    instead of adding the filter method there, here you can avoid the 
generation of the record itself if `point.features(idx)` is 0.0



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Reply via email to