[
https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng resolved SPARK-3161.
----------------------------------
Resolution: Fixed
Fix Version/s: 1.2.0
Issue resolved by pull request 2868
[https://github.com/apache/spark/pull/2868]
> Cache example-node map for DecisionTree training
> ------------------------------------------------
>
> Key: SPARK-3161
> URL: https://issues.apache.org/jira/browse/SPARK-3161
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Joseph K. Bradley
> Assignee: Sung Chung
> Fix For: 1.2.0
>
>
> Improvement: worker computation
> When training each level of a DecisionTree, each example needs to be mapped
> to a node in the current level (or to none if it does not reach that level).
> This is currently done via the function predictNodeIndex(), which traces from
> the current tree’s root node to the given level.
> Proposal: Cache this mapping.
> * Pro: O(1) lookup instead of O(level).
> * Con: Extra RDD which must share the same partitioning as the training data.
> Design:
> * (option 1) This could be done as in [Sequoia Forests |
> https://github.com/AlpineNow/SparkML2] where each instance is stored with an
> array of node indices (1 node per tree).
> * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int,
> Array\[TreePoint\]\]\]\], where each partition stores an array of maps from
> node indices to an array of instances. This has more overhead in data
> structures but could be more efficient: not all nodes are split on each
> iteration, and this would allow each executor to ignore instances which are
> not used for the current node set.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]