GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/5009
[SPARK-6113] [mllib] [WIP] Stabilize DecisionTree and ensembles APIs
This is a **WIP** PR for cleaning up and finalizing the DecisionTree and
tree ensembles APIs.
*Please discuss overall design on the JIRA and implementation details here
in this PR.*
### Summary
Here is the description copied from the JIRA:
> **Issue**: The APIs for DecisionTree and ensembles (RandomForests and
GradientBoostedTrees) have been experimental for a long time. The API has
become very convoluted because trees and ensembles have many, many variants,
some of which we have added incrementally without a long-term design.
> **Proposal**: This JIRA is for discussing changes required to finalize
the APIs. After we discuss, I will make a PR to update the APIs and make them
non-Experimental. This will require making many breaking changes; see the
design doc for details.
> **[Design
doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)**
: This outlines current issues and the proposed API.
Overall code layout:
* The old API in mllib.tree.* will remain the same.
* The new API will reside in mllib.classification.* and mllib.regression.*
### What to review
Currently, the only items to review are the 2 examples and their
demonstration of setting parameters and calling run():
* NewDT.scala
* JavaNewDT.java
Note these 2 examples compile and run. You may find it interesting to see
how parameters are implementing in traits; this is based on the ml.param.Param
implementation and the ongoing dev list discussion on enum-like types. Please
do NOT bother to comment on details or messy things yet.
Current questions:
* Should we rename âFeatureSubsetStrategy?â
* Is "DecisionTreeClassificationModel" too long a name?
* How should users select parameter options? See examples, and also see
doc images I will paste below:
* RandomForestClassifier.featureSubsetStrategies.Auto/All/etc.
* DecisionTreeClassifier.Entropy/Gini
### Notes
#### FeatureSubsetStrategy options
Another option for FeatureSubsetStrategy would be something like:
```
import org.apache.spark.mllib.tree.FeatureSubsetStrategy
...
rf.setFeatureSubsetStrategy(FeatureSubsetStrategy.Fraction(0.2))
```
The main issue is that FeatureSubsetStrategy is shared across
classification and regression, so we must either (a) import it from a shared
subpackage like o.a.s.mllib.tree or (b) make it available in both
o.a.s.mllib.classification and o.a.s.mllib.regression.
CC: @mengxr @manishamde @codedeft @chouqin
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark dt-api
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5009.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5009
----
commit a335cb7f77d1a1cd2c40ee1cc7fdfff36afe7081
Author: Joseph K. Bradley <[email protected]>
Date: 2015-03-13T07:30:46Z
Initial sketch of new tree API, especially parameters
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]