GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3461
[SPARK-4580] [SPARK-4610] [mllib] Documentation for tree ensembles +
DecisionTree API fix
Major changes:
* Added documentation for tree ensembles
* Added examples for tree ensembles
* **API change**: Standardized the tree parameter for the number of classes
(for classification)
Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
* Use train/test split, and compute test error instead of training error.
* Fixed decision_tree_runner.py to actually use the number of classes it
computes from data. (small bug fix)
Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.
(The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming
CC: @mengxr @manishamde @codedeft
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark ensemble-docs
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3461.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3461
----
commit ad3e6952aabbdabb9f5354eae5e0fec30e9b07f8
Author: Joseph K. Bradley <[email protected]>
Date: 2014-11-25T18:17:20Z
added gbt and random forest to programming guide. still need to update
their examples
commit 6372a2b66105d4e5a512f191c3b2518e275d69a2
Author: Joseph K. Bradley <[email protected]>
Date: 2014-11-25T20:32:06Z
updated decision tree examples to use random split. tested all of them.
commit cdfdfbca641c605172171b706c02bdba067466bb
Author: Joseph K. Bradley <[email protected]>
Date: 2014-11-25T22:58:10Z
added examples for GBT
commit 07fc11db76c9c38e265fa23957394d521d0bd7eb
Author: Joseph K. Bradley <[email protected]>
Date: 2014-11-25T23:48:13Z
Renamed numClassesForClassification to numClasses everywhere in trees and
ensembles.
This is a breaking API change, but it was necessary to correct an API
inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala
used numClassesForClassification).
Added examples to programming guide for all ensembles.
commit abe5ed7dcfa365cff79e49737658209569b406d7
Author: Joseph K. Bradley <[email protected]>
Date: 2014-11-26T00:19:03Z
added examples for random forest in Java and Python to examples folder
commit c76c823ee52ecd08039ea63aff90a732a2e073b8
Author: Joseph K. Bradley <[email protected]>
Date: 2014-11-26T00:31:41Z
added migration guide for mllib
commit 706d332b5e9a0ed85518d5ad62e57a9908169028
Author: Joseph K. Bradley <[email protected]>
Date: 2014-11-26T00:56:39Z
updated python DT runner to print full model if it is small
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]