[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

jkbradley Tue, 09 Sep 2014 21:08:09 -0700

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/2341


    [SPARK-3160] [mllib]  DecisionTree: eliminate pre-allocated nodes, 
parentImpurities arrays

    This PR includes some code simplifications and re-organization which will 
be helpful for implementing random forests.  The main changes are that the 
nodes and parentImpurities arrays are no longer pre-allocated in the main 
train() method.
    
    Relation to RFs:
    * Since RFs will be deeper and will therefore be more likely sparse (not 
full trees), it could be a cost savings to avoid pre-allocating a full tree.
    * The associated re-organization also reduces bookkeeping, which will make 
RFs easier to implement.
    * The return code doneTraining may be generalized to include cases such as 
nodes ready for local training.
    
    Details:
    
    No longer pre-allocate parentImpurities array in main train() method.
    * parentImpurities values are now stored in individual nodes (in 
Node.stats.impurity).
    * These were not really needed.  They were used in calculateGainForSplit(), 
but they can be calculated anyways using parentNodeAgg.
    
    No longer using Node.build since tree structure is constructed on-the-fly.
    * Did not eliminate since it is public (Developer) API.  Marked as 
deprecated.
    
    Eliminated pre-allocated nodes array in main train() method.
    * Nodes are constructed and added to the tree structure as needed during 
training.
    * Moved tree construction from main train() method into 
findBestSplitsPerGroup() since there is no need to keep the (split, gain) array 
for an entire level of nodes.  Only one element of that array is needed at a 
time, so we do not the array.
    
    findBestSplits() now returns 2 items:
    * rootNode (newly created root node on first iteration, same root node on 
later iterations)
    * doneTraining (indicating if all nodes at that level were leafs)
    
    Updated DecisionTreeSuite.  Notes:
    * Improved test "Second level node building with vs. without groups"
    ** generateOrderedLabeledPoints() modified so that it really does require 2 
levels of internal nodes.
    * Related update: Added Node.deepCopy (private[tree]), used for test suite
    
    CC: @mengxr

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-spark-3160

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2341.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2341
    
----
commit 2ab763b2ca1bbc8977777ab898b28965dce5a8a3
Author: Joseph K. Bradley <[email protected]>
Date:   2014-09-09T17:42:46Z

    Simplifications to DecisionTree code:
    
    No longer pre-allocate parentImpurities array in main train() method.
    * parentImpurities values are now stored in individual nodes (in 
Node.stats.impurity).
    
    No longer using Node.build since tree structure is constructed on-the-fly.
    * Did not eliminate since it is public (Developer) API.
    
    Also: Updated DecisionTreeSuite test "Second level node building with vs. 
without groups"
    * generateOrderedLabeledPoints() modified so that it really does require 2 
levels of internal nodes.

commit 1a8f0add470e4ed53100ce6cf344e24448a0ba42
Author: Joseph K. Bradley <[email protected]>
Date:   2014-09-10T02:34:55Z

    Eliminated pre-allocated nodes array in main train() method.
    * Nodes are constructed and added to the tree structure as needed during 
training.
    
    Moved tree construction from main train() method into 
findBestSplitsPerGroup() since there is no need to keep the (split, gain) array 
for an entire level of nodes.  Only one element of that array is needed at a 
time, so we do not the array.
    
    findBestSplits() now returns 2 items:
    * rootNode (newly created root node on first iteration, same root node on 
later iterations)
    * doneTraining (indicating if all nodes at that level were leafs)
    
    Also:
    * Added Node.deepCopy (private[tree]), used for test suite
    * Updated test suite (same functionality)

commit d4dbb99a50418e0168d85db457458d8d96edc242
Author: Joseph K. Bradley <[email protected]>
Date:   2014-09-10T02:35:06Z

    Merge remote-tracking branch 'upstream/master' into dt-spark-3160

commit d4d786407a9bb5fce14dd7999097b21d6fa1cf5e
Author: Joseph K. Bradley <[email protected]>
Date:   2014-09-10T02:45:30Z

    Marked Node.build as deprecated

commit eaa1dcf6a46501779ae58c746e672583d10ff6c8
Author: Joseph K. Bradley <[email protected]>
Date:   2014-09-10T02:58:27Z

    Added topNode doc in DecisionTree and scalastyle fix

commit 306120fc93021f3d2d86333c77296fe3d36b76e1
Author: Joseph K. Bradley <[email protected]>
Date:   2014-09-10T03:09:02Z

    Fixed typo in DecisionTreeModel.scala doc

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Reply via email to