[GitHub] spark pull request: [SPARK-3366][MLLIB]Compute best splits distrib...

jkbradley Wed, 01 Oct 2014 11:14:53 -0700

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2595#discussion_r18297115
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -518,30 +512,69 @@ object DecisionTree extends Serializable with Logging 
{
           agg
         }
     
    -    // Calculate bin aggregates.
    -    timer.start("aggregation")
    -    val binAggregates: DTStatsAggregator = {
    -      val initAgg = if (metadata.subsamplingFeatures) {
    -        new DTStatsAggregatorSubsampledFeatures(metadata, 
treeToNodeToIndexInfo)
    -      } else {
    -        new DTStatsAggregatorFixedFeatures(metadata, numNodes)
    +    /**
    +     * Get node index in group --> features indices map,
    +     * which is a short cut to find feature indices for a node given node 
index in group
    +     * @param treeToNodeToIndexInfo
    +     * @return
    +     */
    +    def getNodeToFeatures(treeToNodeToIndexInfo: Map[Int, Map[Int, 
NodeIndexInfo]])
    --- End diff --
    
    I like this more since it limits the reshaped data to this one place the 
data are used.  Since the map now only holds feature subsets (not node 
indices), it might be worth using Option[Map[Int, Array[Int]]] instead of 
Map[Int, Option[Array[Int]]].  (Saves space for DecisionTree.)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3366][MLLIB]Compute best splits distrib...

Reply via email to