GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/3461

    [SPARK-4580] [SPARK-4610] [mllib] Documentation for tree ensembles + 
DecisionTree API fix

    Major changes:
    * Added documentation for tree ensembles
    * Added examples for tree ensembles
    * **API change**: Standardized the tree parameter for the number of classes 
(for classification)
    
    Minor changes:
    * Updated decision tree documentation
    * Updated existing tree and tree ensemble examples
     * Use train/test split, and compute test error instead of training error.
     * Fixed decision_tree_runner.py to actually use the number of classes it 
computes from data. (small bug fix)
    
    Note: I know this is a lot of lines, but most is covered by:
    * Programming guide sections for gradient boosting and random forests.  
(The changes are probably best viewed by generating the docs locally.)
    * New examples (which were copied from the programming guide)
    * The "numClasses" renaming
    
    CC: @mengxr @manishamde @codedeft

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark ensemble-docs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3461.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3461
    
----
commit ad3e6952aabbdabb9f5354eae5e0fec30e9b07f8
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-25T18:17:20Z

    added gbt and random forest to programming guide.  still need to update 
their examples

commit 6372a2b66105d4e5a512f191c3b2518e275d69a2
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-25T20:32:06Z

    updated decision tree examples to use random split.  tested all of them.

commit cdfdfbca641c605172171b706c02bdba067466bb
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-25T22:58:10Z

    added examples for GBT

commit 07fc11db76c9c38e265fa23957394d521d0bd7eb
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-25T23:48:13Z

    Renamed numClassesForClassification to numClasses everywhere in trees and 
ensembles.
    This is a breaking API change, but it was necessary to correct an API 
inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala 
used numClassesForClassification).
    
    Added examples to programming guide for all ensembles.

commit abe5ed7dcfa365cff79e49737658209569b406d7
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-26T00:19:03Z

    added examples for random forest in Java and Python to examples folder

commit c76c823ee52ecd08039ea63aff90a732a2e073b8
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-26T00:31:41Z

    added migration guide for mllib

commit 706d332b5e9a0ed85518d5ad62e57a9908169028
Author: Joseph K. Bradley <[email protected]>
Date:   2014-11-26T00:56:39Z

    updated python DT runner to print full model if it is small

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to