[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

manishamde Sun, 30 Nov 2014 19:25:52 -0800

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3461#discussion_r21068173
  
    --- Diff: docs/mllib-gbt.md ---
    @@ -0,0 +1,308 @@
    +---
    +layout: global
    +title: Gradient-Boosted Trees - MLlib
    +displayTitle: <a href="mllib-guide.html">MLlib</a> - Gradient-Boosted Trees
    +---
    +
    +* Table of contents
    +{:toc}
    +
    +[Gradient-Boosted Trees 
(GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
    +are ensembles of [decision trees](mllib-decision-tree.html).
    +GBTs iteratively train decision trees in order to minimize a loss function.
    +Like decision trees, GBTs handle categorical features,
    +extend to the multiclass classification setting, do not require
    +feature scaling, and are able to capture non-linearities and feature 
interactions.
    +
    +MLlib supports GBTs for binary classification and for regression,
    +using both continuous and categorical features.
    +MLlib implements GBTs using the existing [decision 
tree](mllib-decision-tree.html) implementation.  Please see the decision tree 
guide for more information on trees.
    +
    +*Note*: GBTs do not yet support multiclass classification.  For multiclass 
problems, please use
    +[decision trees](mllib-decision-tree.html) or [Random 
Forests](mllib-random-forest.html).
    +
    +## Basic algorithm
    +
    +Gradient boosting iteratively trains a sequence of decision trees.
    +On each iteration, the algorithm uses the current ensemble to predict the 
label of each training instance and then compares the prediction with the true 
label.  The dataset is re-labeled to put more weight on training instances with 
poor predictions.  Thus, in the next iteration, the decision tree will help 
correct for previous mistakes.
    +
    +The specific weight mechanism is defined by a loss function (discussed 
below).  With each iteration, GBTs further reduce this loss function on the 
training data.
    +
    +### Comparison with Random Forests
    --- End diff --
    
    I really like this section since this is very useful information. We should 
try and add some graphs here in a separate PR. However, shouldn't this be in a 
separate section under Ensemble comparing both RF and Boosting algorithms in 
terms of performance and accuracy.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4580] [SPARK-4610] [mllib] Documentatio...

Reply via email to