GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/12118
[SPARK-13784][ML] Persistence for RandomForestClassifier,
RandomForestRegressor
## What changes were proposed in this pull request?
**Main change**: Added save/load for RandomForestClassifier,
RandomForestRegressor (implementation details below)
Modified numTrees method (*deprecation*)
* Goal: Use default implementations of unit tests which assume Estimators
and Models share the same set of Params.
* What this PR does: Moves method numTrees outside of trait
TreeEnsembleModel. Adds it to GBT and RF Models. Deprecates it in RF Models
in favor of new method getNumTrees. In Spark 2.1, we can have RF Models
include Param numTrees.
Minor items
* Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods
where they assign the wrong old UID.
**Implementation details**
* Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to
reuse some code for ensembles.
* Added EnsembleModelReadWrite object with save/load implementations usable
for RFs and GBTs
* These store all trees' nodes in a single DataFrame, and all trees'
metadata in a second DataFrame.
* Split trait RandomForestParams into parts in order to add more Estimator
Params to RF models
* Split DefaultParamsWriter.saveMetadata into two methods to allow
ensembles to store sub-models' metadata in a single DataFrame. Same for
DefaultParamsReader.loadMetadata
## How was this patch tested?
Adds standard unit tests for RF save/load
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark GayathriMurali-SPARK-13784
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12118.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12118
----
commit 03bb8880e73b6c107d9a13ab90ce7f61a8756c8f
Author: GayathriMurali <[email protected]>
Date: 2016-03-23T21:09:35Z
SPARK-13784 Model export/import for Spark ml RandomForests
commit 68b9358f128c365d573c5881b06f420276fd44ff
Author: GayathriMurali <[email protected]>
Date: 2016-03-29T02:17:41Z
SPARK-13783 Model export/import for spark.ml:RandomForests
commit 2d89b4c1ad08290a7118a49413e9ed2b4722471e
Author: Joseph K. Bradley <[email protected]>
Date: 2016-04-01T19:00:33Z
Implemented read/write for RandomForestClassifier, Regressor.
commit f2f89eb74d8dd1f20915ebe4db254aa2e529ce9d
Author: Joseph K. Bradley <[email protected]>
Date: 2016-04-01T19:54:45Z
PR cleanup
commit 623b309e1a70a33acc1e03791f0450147d37eb16
Author: Joseph K. Bradley <[email protected]>
Date: 2016-04-01T21:45:19Z
Moved numTrees Param outside of RandomForest*Model Params, and deprecated
current numTrees val
commit de7ef9d1700c6653e035c3bff273165f7b73c24a
Author: Joseph K. Bradley <[email protected]>
Date: 2016-04-01T21:58:53Z
PR cleanups
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]