Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/3461#discussion_r21113959
--- Diff: docs/mllib-gbt.md ---
@@ -0,0 +1,308 @@
+---
+layout: global
+title: Gradient-Boosted Trees - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Gradient-Boosted Trees
+---
+
+* Table of contents
+{:toc}
+
+[Gradient-Boosted Trees
(GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
+are ensembles of [decision trees](mllib-decision-tree.html).
+GBTs iteratively train decision trees in order to minimize a loss function.
+Like decision trees, GBTs handle categorical features,
+extend to the multiclass classification setting, do not require
+feature scaling, and are able to capture non-linearities and feature
interactions.
+
+MLlib supports GBTs for binary classification and for regression,
+using both continuous and categorical features.
+MLlib implements GBTs using the existing [decision
tree](mllib-decision-tree.html) implementation. Please see the decision tree
guide for more information on trees.
+
+*Note*: GBTs do not yet support multiclass classification. For multiclass
problems, please use
+[decision trees](mllib-decision-tree.html) or [Random
Forests](mllib-random-forest.html).
+
+## Basic algorithm
+
+Gradient boosting iteratively trains a sequence of decision trees.
+On each iteration, the algorithm uses the current ensemble to predict the
label of each training instance and then compares the prediction with the true
label. The dataset is re-labeled to put more weight on training instances with
poor predictions. Thus, in the next iteration, the decision tree will help
correct for previous mistakes.
+
+The specific weight mechanism is defined by a loss function (discussed
below). With each iteration, GBTs further reduce this loss function on the
training data.
+
+### Comparison with Random Forests
+
+Both GBTs and [Random Forests](mllib-random-forest.html) are algorithms
for learning ensembles of trees, but the training processes are different.
There are several practical trade-offs:
+
+ * GBTs may be able to achieve the same accuracy using fewer trees, so the
model produced may be smaller (faster for test time prediction).
+ * GBTs train one tree at a time, so they can take longer to train than
random forests. Random Forests can train multiple trees in parallel.
+ * On the other hand, it is often reasonable to use smaller trees with
GBTs than with Random Forests, and training smaller trees takes less time.
+ * Random Forests can be less prone to overfitting. Training more trees
in a Random Forest reduces the likelihood of overfitting, but training more
trees with GBTs increases the likelihood of overfitting.
+
+In short, both algorithms can be effective. GBTs may be more useful if
test time prediction speed is important. Random Forests are arguably more
successful in industry.
--- End diff --
I'll say less : )
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]