GitHub user yinxusen opened a pull request:
https://github.com/apache/spark/pull/12604
[SPARK-14706] Python ML persistence integration test
## What changes were proposed in this pull request?
This patch tests Python ML persistence integration.
- Add persistency tests of
CrossValidator(TrainValidationSplit(LogisticRegression)) and
TrainValidationSplit(CrossValidator(LogisticRegression)).
- Enhance `_compare_pipelines` with checking of CrossValidator,
CrossValidatorModel, TrainValidationSplit, TrainValidationSplitModel,
OneVsRest, OneVsRestModel.
**Bugs found and fixed in this PR:**
- OneVsRest, CrossValidator and TrainValidationSplit should have
`_transfer_param_map_to_java` and `_transfer_param_map_from_java`, otherwise
they can't be used as estimators in tuning.
-
```scala
lr = LogisticRegression()
lr.getThresholds()
```
produces `keyNotFoundError` because thresholds neither be set nor in
`_defaultParamMap`, which leads the previous JavaParams parameter equality
check error.
- `trainRatio` in `TrainValidationSplit` should have float type converter.
- `OneVsRest` with `classifier` in `estimatorParamMaps` of tuning fail to
persistence. I.e.
```scala
ovr = OneVsRest()
epms = [{ovr.classifier: xxxx}, {ovr.classifier: xxx}]
cv = CrossValidator(estimator=ovr, estimatorParamMaps=epms, ...)
cv.load()
```
fails because classifier cannot be serialized via JSON.
The last one is not trivial, so I left it unsolved in this PR.
## How was this patch tested?
The patch tests with Python unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yinxusen/spark SPARK-14706
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12604.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12604
----
commit 52046d496a9e97cbb948d67ba6f0e78923b30732
Author: Xusen Yin <[email protected]>
Date: 2016-04-19T01:52:58Z
add tests for meta-algorithms persistence
commit b843673f0ae33c6f347896837e47c9eb37260f19
Author: Xusen Yin <[email protected]>
Date: 2016-04-20T18:48:01Z
fix none with sqlContext other than sc.parallelize.toDF
commit f5c14d5d9c83eb72e04a360a9941d81b74ca8f3d
Author: yinxusen <[email protected]>
Date: 2016-04-21T18:18:08Z
add seed in transfer to/from java
commit b62c6ab4e1c8a1135292cabdfa3003d1a65e0965
Author: yinxusen <[email protected]>
Date: 2016-04-21T18:35:34Z
add CrossValidatorParams and TrainValidationSplit for save/load consistency
commit 73835dd8cac2ced02a9f251f50cbc9457bfe6c41
Author: yinxusen <[email protected]>
Date: 2016-04-21T18:40:18Z
add transfer param map for TrainValidateSplit
commit 60cfe38c6b8e34c87da3be9767f850cfffe3a55e
Author: yinxusen <[email protected]>
Date: 2016-04-21T20:46:42Z
add transfer param map for OneVsRest/Model
commit 842e6064b3d66a38fb618bafc88bebe4c1a4f51e
Author: yinxusen <[email protected]>
Date: 2016-04-22T05:58:17Z
fix cv wraps tvs and tvs wraps cv
commit fa570c663fd07cb520ac8c05f98887e6c0cf4ad2
Author: yinxusen <[email protected]>
Date: 2016-04-22T06:14:35Z
fix transfer param map for ovr
commit 40d48baaa13c7a014116d4a1845c84adb024b22c
Author: yinxusen <[email protected]>
Date: 2016-04-22T06:18:18Z
merge with master
commit 622e5647a271e68854d75aafa10972a89585df56
Author: yinxusen <[email protected]>
Date: 2016-04-22T06:35:04Z
fix style
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]