Github user codedeft commented on the pull request:
https://github.com/apache/spark/pull/2435#issuecomment-55967377
Hi Joseph,
I'll take a look when I can, but this is a massive PR, so I'm not sure if
I'll have time to go through this thoroughly.
* I suppose that from Alpine's point of view, we should then wait until
this is merged before contributing local training and others?
* Does this also include node caching (instead of passing trees to
executors) ?
Thanks!
On Wed, Sep 17, 2014 at 12:45 PM, jkbradley <[email protected]>
wrote:
> This PR adds RandomForest to MLlib. The implementation is basic, and
> future performance optimizations will be important. (Note: RFs = Random
> Forests.)
> Overview RandomForest
>
> - trains multiple trees at once to reduce the number of passes over
> the data
> - allows feature subsets at each node
> - uses a queue of nodes instead of fixed groups for each level
>
> This implementation is based an implementation by @manishamde
> <https://github.com/manishamde> and the Alpine Labs Sequoia Forest
> <https://github.com/AlpineNow/SparkML2> by @codedeft
> <https://github.com/codedeft> (in particular, the TreePoint, BaggedPoint,
> and node queue implementations). Thank you for your inputs!
> Testing
>
> This has been tested for correctness with the test suites and with
> DecisionTreeRunner on example datasets.
> This has been performance tested using this branch of spark-perf
> <https://github.com/jkbradley/spark-perf/tree/rfs>. For training 1 tree,
> there are small regressions, especially from feature subsampling.
>
> Detailed results are below. These were run on an EC2 cluster with 15
> workers, training 1 tree with maxDepth = 5 (= 6 levels). The 2 result
> columns marked with (numTrees) are results after implementing RFs to train
> multiple trees at once, using a node queue. The 2 columns marked with
> (features subsets) are follow-up results after adding feature subsampling.
> Speedup values < 1 indicate slowdowns from the old DecisionTree
> implementation.
> numInstances numFeatures runtime (sec) speedup runtime (sec) speedup
> (numTrees) (numTrees) (feature subsets) (feature subsets) 20000 100 4.051
> 1.044433473 4.478 0.9448414471 20000 500 8.472 1.104461756 9.315
> 1.004508857 20000 1500 19.354 1.05854087 20.863 0.9819776638 20000 3500
> 43.674 1.072033704 45.887 1.020332556 200000 100 4.196 1.171830315 4.848
> 1.014232673 200000 500 8.926 1.082791844 9.771 0.989151571 200000 1500
> 20.58 1.068415938 22.134 0.9934038131 200000 3500 48.043 1.075203464
> 52.249 0.9886505005 2000000 100 4.944 1.01355178 5.796 0.8645617667
> 2000000 500 11.11 1.016831683 12.482 0.9050632911 2000000 1500 31.144
> 1.017852556 35.274 0.8986789136 2000000 3500 79.981 1.085382778 101.105
> 0.8586123337 20000000 100 8.304 0.9270231214 9.073 0.8484514494 20000000
> 500 28.174 1.083268262 34.236 0.8914592826 20000000 1500 143.97
> 0.9579634646 159.275 0.8659111599 Details on specific classes Changes to
> DecisionTree
>
> - Main train() method is now in RandomForest.
> - findBestSplits() is no longer needed. (It split levels into groups,
> but we now use a queue of nodes.)
> - Many small changes to support RFs. (Note: These methods should be
> moved to RandomForest.scala in a later PR, but are in
DecisionTree.scala to
> make code comparison easier.)
>
> RandomForest
>
> - Main train() method is from old DecisionTree.
> - selectNodesToSplit: Note that it selects nodes and feature subsets
> jointly to track memory usage.
>
> RandomForestModel
>
> - Stores an Array[DecisionTreeModel]
> - Prediction:
> - For classification, most common label. For regression, mean.
> - We could support other methods later.
>
> examples/.../DecisionTreeRunner
>
> - This now takes numTrees and featureSubsetStrategy, to support RFs.
>
> DTStatsAggregator
>
> - 2 types of functionality (w/ and w/o subsampling features): These
> require different indexing methods. (We could treat both as
subsampling,
> but this is less efficient DTStatsAggregator is now abstract, and 2
child
> classes implement these 2 types of functionality.
>
> impurities
>
> - These now take instance weights.
>
> Node
>
> - Some vals changed to vars.
> - This is unfortunately a public API change (DeveloperApi). This
> could be avoided by creating a LearningNode struct, but would be
awkward.
>
> RandomForestSuite
>
> Please let me know if there are missing tests!
> BaggedPoint
>
> This wraps TreePoint and holds bootstrap weights/counts.
> Design decisions
>
> -
>
> BaggedPoint: BaggedPoint is separate from TreePoint since it may be
> useful for other bagging algorithms later on.
> -
>
> RandomForest public API: What options should be easily supported by
> the train* methods? Should ALL options be in the Java-friendly
> constructors? Should there be a constructor taking Strategy?
> -
>
> Feature subsampling options: What options should be supported?
> scikit-learn supports the same options, except for "onethird." One
option
> would be to allow users to specific fractions ("0.1"): the current
options
> could be supported, and any unrecognized values would be parsed as
Doubles
> in [0,1].
> -
>
> Splits and bins are computed before bootstrapping, so all trees use
> the same discretization.
> -
>
> One queue, instead of one queue per tree.
>
> CC: @mengxr <https://github.com/mengxr> @manishamde
> <https://github.com/manishamde> @codedeft <https://github.com/codedeft>
> @chouqin <https://github.com/chouqin> Please let me know if you have
> suggestions---thanks!
> ------------------------------
> You can merge this Pull Request by running
>
> git pull https://github.com/jkbradley/spark rfs-new
>
> Or view, comment on, or merge it at:
>
> https://github.com/apache/spark/pull/2435
> Commit Summary
>
> - add min info gain and min instances per node parameters in decision
> tree
> - separate calculation of predict of node from calculation of info gain
> - fix bug
> - fix style
> - fix style
> - add comments
> - fix bug
> - add api docs
> - Merge branch 'master' of https://github.com/apache/spark into
> dt-preprune
> - minor fix: remove empty lines
> - fix style
> - fix bug
> - Simplifications to DecisionTree code:
> - fix docs and change minInstancesPerNode to 1
> - remove `noSplit` and set `Predict` private to tree
> - Eliminated pre-allocated nodes array in main train() method.
> - Merge remote-tracking branch 'upstream/master' into dt-spark-3160
> - Marked Node.build as deprecated
> - Added topNode doc in DecisionTree and scalastyle fix
> - Fixed typo in DecisionTreeModel.scala doc
> - Added minInstancesPerNode and minInfoGain parameters to
> DecisionTreeRunner.scala and to Python API in tree.py
> - change edge `minInstancesPerNode` to 2 and add one more test
> - fix typo
> - fix typo
> - Merge remote-tracking branch 'chouqin/dt-preprune' into
> chouqin-dt-preprune
> - Merge remote-tracking branch 'upstream/master' into
> chouqin-dt-preprune
> - * Fixed typo in tree suite test "do not choose split that does not
> satisfy min instance per node requirements"
> - Merge remote-tracking branch 'upstream/master' into dt-spark-3160
> - Added check in Strategy to make sure minInstancesPerNode >= 1
> - Merge remote-tracking branch 'upstream/master' into rfs-new
> - RFs partly implemented, not done yet
> - Merge remote-tracking branch 'upstream/master' into
> chouqin-dt-preprune
> - Added max of 10GB for maxMemoryInMB in Strategy.
> - Merge remote-tracking branch 'upstream/master' into rfs-new
> - Merge branch 'chouqin-dt-preprune' into rfs-new
> - Basic random forests are implemented. Random features per node not
> yet implemented. Test suite not implemented.
> - Updated docs. Small fix for bug which does not cause errors: No
> longer allocate unused child nodes for leaf nodes.
> - Merge remote-tracking branch 'upstream/master' into rfs-new. Added
> RandomForestModel.toString
> - Implemented feature subsampling. Tested DecisionTree but not
> RandomForest.
> - Added numTrees and featureSubsetStrategy to DecisionTreeRunner (to
> support RandomForest). Fixed bugs so that RandomForest now runs.
> - Added RandomForestSuite, and fixed small bugs, style issues.
> - Merge remote-tracking branch 'upstream/master' into rfs-new
> - removed usage of null from RandomForest and replaced with Option
>
> File Changes
>
> - *M*
>
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
> <https://github.com/apache/spark/pull/2435/files#diff-0> (82)
> - *M*
> mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
> <https://github.com/apache/spark/pull/2435/files#diff-1> (477)
> - *A*
> mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala
> <https://github.com/apache/spark/pull/2435/files#diff-2> (430)
> - *A*
> mllib/src/main/scala/org/apache/spark/mllib/tree/impl/BaggedPoint.scala
> <https://github.com/apache/spark/pull/2435/files#diff-3> (80)
> - *M*
>
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala
> <https://github.com/apache/spark/pull/2435/files#diff-4> (311)
> - *M*
>
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
> <https://github.com/apache/spark/pull/2435/files#diff-5> (38)
> - *M*
> mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Entropy.scala
> <https://github.com/apache/spark/pull/2435/files#diff-6> (4)
> - *M*
> mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala
> <https://github.com/apache/spark/pull/2435/files#diff-7> (4)
> - *M*
>
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala
> <https://github.com/apache/spark/pull/2435/files#diff-8> (2)
> - *M*
>
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Variance.scala
> <https://github.com/apache/spark/pull/2435/files#diff-9> (8)
> - *M* mllib/src/main/scala/org/apache/spark/mllib/tree/model/Node.scala
> <https://github.com/apache/spark/pull/2435/files#diff-10> (13)
> - *A*
>
mllib/src/main/scala/org/apache/spark/mllib/tree/model/RandomForestModel.scala
> <https://github.com/apache/spark/pull/2435/files#diff-11> (106)
> - *M*
>
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
> <https://github.com/apache/spark/pull/2435/files#diff-12> (203)
> - *A*
>
mllib/src/test/scala/org/apache/spark/mllib/tree/RandomForestSuite.scala
> <https://github.com/apache/spark/pull/2435/files#diff-13> (219)
>
> Patch Links:
>
> - https://github.com/apache/spark/pull/2435.patch
> - https://github.com/apache/spark/pull/2435.diff
>
> â
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/2435>.
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]