GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/2341
[SPARK-3160] [mllib] DecisionTree: eliminate pre-allocated nodes,
parentImpurities arrays
This PR includes some code simplifications and re-organization which will
be helpful for implementing random forests. The main changes are that the
nodes and parentImpurities arrays are no longer pre-allocated in the main
train() method.
Relation to RFs:
* Since RFs will be deeper and will therefore be more likely sparse (not
full trees), it could be a cost savings to avoid pre-allocating a full tree.
* The associated re-organization also reduces bookkeeping, which will make
RFs easier to implement.
* The return code doneTraining may be generalized to include cases such as
nodes ready for local training.
Details:
No longer pre-allocate parentImpurities array in main train() method.
* parentImpurities values are now stored in individual nodes (in
Node.stats.impurity).
* These were not really needed. They were used in calculateGainForSplit(),
but they can be calculated anyways using parentNodeAgg.
No longer using Node.build since tree structure is constructed on-the-fly.
* Did not eliminate since it is public (Developer) API. Marked as
deprecated.
Eliminated pre-allocated nodes array in main train() method.
* Nodes are constructed and added to the tree structure as needed during
training.
* Moved tree construction from main train() method into
findBestSplitsPerGroup() since there is no need to keep the (split, gain) array
for an entire level of nodes. Only one element of that array is needed at a
time, so we do not the array.
findBestSplits() now returns 2 items:
* rootNode (newly created root node on first iteration, same root node on
later iterations)
* doneTraining (indicating if all nodes at that level were leafs)
Updated DecisionTreeSuite. Notes:
* Improved test "Second level node building with vs. without groups"
** generateOrderedLabeledPoints() modified so that it really does require 2
levels of internal nodes.
* Related update: Added Node.deepCopy (private[tree]), used for test suite
CC: @mengxr
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark dt-spark-3160
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2341.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2341
----
commit 2ab763b2ca1bbc8977777ab898b28965dce5a8a3
Author: Joseph K. Bradley <[email protected]>
Date: 2014-09-09T17:42:46Z
Simplifications to DecisionTree code:
No longer pre-allocate parentImpurities array in main train() method.
* parentImpurities values are now stored in individual nodes (in
Node.stats.impurity).
No longer using Node.build since tree structure is constructed on-the-fly.
* Did not eliminate since it is public (Developer) API.
Also: Updated DecisionTreeSuite test "Second level node building with vs.
without groups"
* generateOrderedLabeledPoints() modified so that it really does require 2
levels of internal nodes.
commit 1a8f0add470e4ed53100ce6cf344e24448a0ba42
Author: Joseph K. Bradley <[email protected]>
Date: 2014-09-10T02:34:55Z
Eliminated pre-allocated nodes array in main train() method.
* Nodes are constructed and added to the tree structure as needed during
training.
Moved tree construction from main train() method into
findBestSplitsPerGroup() since there is no need to keep the (split, gain) array
for an entire level of nodes. Only one element of that array is needed at a
time, so we do not the array.
findBestSplits() now returns 2 items:
* rootNode (newly created root node on first iteration, same root node on
later iterations)
* doneTraining (indicating if all nodes at that level were leafs)
Also:
* Added Node.deepCopy (private[tree]), used for test suite
* Updated test suite (same functionality)
commit d4dbb99a50418e0168d85db457458d8d96edc242
Author: Joseph K. Bradley <[email protected]>
Date: 2014-09-10T02:35:06Z
Merge remote-tracking branch 'upstream/master' into dt-spark-3160
commit d4d786407a9bb5fce14dd7999097b21d6fa1cf5e
Author: Joseph K. Bradley <[email protected]>
Date: 2014-09-10T02:45:30Z
Marked Node.build as deprecated
commit eaa1dcf6a46501779ae58c746e672583d10ff6c8
Author: Joseph K. Bradley <[email protected]>
Date: 2014-09-10T02:58:27Z
Added topNode doc in DecisionTree and scalastyle fix
commit 306120fc93021f3d2d86333c77296fe3d36b76e1
Author: Joseph K. Bradley <[email protected]>
Date: 2014-09-10T03:09:02Z
Fixed typo in DecisionTreeModel.scala doc
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]