GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/5530
[SPARK-6113] [ml] Stabilize DecisionTree API
This is a PR for cleaning up and finalizing the DecisionTree API. PRs for
ensembles will follow once this is merged.
### Goal
Here is the description copied from the JIRA (for both trees and ensembles):
> **Issue**: The APIs for DecisionTree and ensembles (RandomForests and
GradientBoostedTrees) have been experimental for a long time. The API has
become very convoluted because trees and ensembles have many, many variants,
some of which we have added incrementally without a long-term design.
> **Proposal**: This JIRA is for discussing changes required to finalize
the APIs. After we discuss, I will make a PR to update the APIs and make them
non-Experimental. This will require making many breaking changes; see the
design doc for details.
> **[Design
doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)**
: This outlines current issues and the proposed API.
Overall code layout:
* The old API in mllib.tree.* will remain the same.
* The new API will reside in ml.classification.* and ml.regression.*
### Summary of changes
Old API
* Exactly the same, except I made 1 method in Loss private (but that is not
a breaking change since that method was introduced after the Spark 1.3 release).
New APIs
* Under Pipeline API
* The new API preserves functionality, except:
* New API does NOT store prob (probability of label in classification).
I want to have it store the full vector of probabilities but feel that should
be in a later PR.
* Use abstractions for parameters, estimators, and models to avoid code
duplication
* Limit parameters to relevant algorithms
* For enum-like types, only expose Strings
* We can make these pluggable later on by adding new parameters. That is
a far-future item.
Test suites
* I organized DecisionTreeSuite, but I made absolutely no changes to the
tests themselves.
* The test suites for the new API only test (a) similarity with the results
of the old API and (b) elements of the new API.
* After code is moved to this new API, we should move the tests from the
old suites which test the internals.
### Details
#### Changed names
Parameters
* useNodeIdCache -> cacheNodeIds
#### Other changes
* Split: Changed categories to set instead of list
#### Non-decision tree changes
* AttributeGroup
* Added parentheses to toMetadata, toStructField methods (These were
removed in a previous PR, but I ran into 1 issue with the Scala compiler not
being able to disambiguate between a toMetadata method with no parentheses and
a toMetadata method which takes 1 argument.)
* Attributes
* Renamed: toMetadata -> toMetadataImpl
* Added toMetadata methods which return ML metadata (keyed with
âML_ATTRâ)
* NominalAttribute: Added getNumValues method which examines both
numValues and values.
* Params.inheritValues: Checks whether the parent param really belongs to
the child (to allow Estimator-Model pairs with different sets of parameters)
### Questions for reviewers
* Is "DecisionTreeClassificationModel" too long a name?
* Is this OK in the docs?
```
class DecisionTreeRegressor extends
TreeRegressor[DecisionTreeRegressionModel] with
DecisionTreeParams[DecisionTreeRegressor] with
TreeRegressorParams[DecisionTreeRegressor]
```
### Future
We should open up the abstractions at some point. E.g., it would be useful
to be able to set tree-related parameters in 1 place and then pass those to
multiple tree-based algorithms.
Follow-up JIRAs will be (in this order):
* Tree ensembles
* Deprecate old tree code
* Move DecisionTree implementation code to new API.
* Move tests from the old suites which test the internals.
* Update programming guide
* Python API
* Change RandomForest* to always use bootstrapping, even when numTrees = 1
* Provide the probability of the predicted label for classification. After
we move code to the new API and update it to maintain probabilities for all
labels, then we can add the probabilities to the new API.
CC: @mengxr @manishamde @codedeft @chouqin @MechCoder
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark dt-api-dt
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5530.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5530
----
commit c72c1a01387bfd0e672f71f385f9a31e9d5d2c1e
Author: Joseph K. Bradley <[email protected]>
Date: 2015-04-01T03:55:22Z
Copied changes for common items, plus DecisionTreeClassifier from original
PR
commit 2532c9a9189cc01592a6f9d7d49d10055f918d7c
Author: Joseph K. Bradley <[email protected]>
Date: 2015-04-02T03:08:08Z
partial move to spark.ml API, not done yet
commit f9fbb605f503a91f4a998cd32fee11510dfd341c
Author: Joseph K. Bradley <[email protected]>
Date: 2015-04-14T05:07:15Z
Done with DecisionTreeClassifier, but no save/load yet. Need to add
example as well
commit 0bdc486e426abaef4c2c3619280450601a458ab8
Author: Joseph K. Bradley <[email protected]>
Date: 2015-04-14T06:28:13Z
fixed issues after param PR was merged
commit 119f407231f52d9339dd8b821b6fe652e6b695b8
Author: Joseph K. Bradley <[email protected]>
Date: 2015-04-14T22:11:20Z
added DecisionTreeClassifier example
commit e11673f8994314add5d2a749c1ab808f126d2bca
Author: Joseph K. Bradley <[email protected]>
Date: 2015-04-14T23:53:03Z
Added DecisionTreeRegressor, test suites, and example
commit 7ef63ed593cbcaa87b0078b548c6c7738499d7b3
Author: Joseph K. Bradley <[email protected]>
Date: 2015-04-14T23:53:19Z
Added DecisionTreeRegressor, test suites, and example (for real this time)
commit f8fbd24877c522138f8d16d2c1855c498b83ba0c
Author: Joseph K. Bradley <[email protected]>
Date: 2015-04-15T16:23:58Z
imported reorg of DecisionTreeSuite from old PR. small cleanups
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]