Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/2595#discussion_r18297115
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -518,30 +512,69 @@ object DecisionTree extends Serializable with Logging
{
agg
}
- // Calculate bin aggregates.
- timer.start("aggregation")
- val binAggregates: DTStatsAggregator = {
- val initAgg = if (metadata.subsamplingFeatures) {
- new DTStatsAggregatorSubsampledFeatures(metadata,
treeToNodeToIndexInfo)
- } else {
- new DTStatsAggregatorFixedFeatures(metadata, numNodes)
+ /**
+ * Get node index in group --> features indices map,
+ * which is a short cut to find feature indices for a node given node
index in group
+ * @param treeToNodeToIndexInfo
+ * @return
+ */
+ def getNodeToFeatures(treeToNodeToIndexInfo: Map[Int, Map[Int,
NodeIndexInfo]])
--- End diff --
I like this more since it limits the reshaped data to this one place the
data are used. Since the map now only holds feature subsets (not node
indices), it might be worth using Option[Map[Int, Array[Int]]] instead of
Map[Int, Option[Array[Int]]]. (Saves space for DecisionTree.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]