GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19904
[SPARK-22707][ML] Optimize CrossValidator fitting memory occupation by
models
## What changes were proposed in this pull request?
Via some test I found CrossValidator still exists memory issue, it will
still occupy `O(n*sizeof(model))` memory for holding models when fitting, if
well optimized, it should be `O(parallelism*sizeof(model))`
This is because modelFutures will hold the reference to model object after
future is complete (we can use `future.value.get.get` to fetch it), and the
`Future.sequence` and the `modelFutures` array holds references to each model
future. So all model object are keep referenced until `fit` return. So it will
still occupy `O(n*sizeof(model))` memory.
I fix this by merging the `modelFuture` and `foldMetricFuture` together,
and via `wait/notify` to unpersist training dataset in time.
## How was this patch tested?
N/A
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/WeichenXu123/spark
fix_cross_validator_memory_issue
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19904.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19904
----
commit 7725fd8a86dddba6c61c7d053dfa510a114bebb8
Author: WeichenXu <[email protected]>
Date: 2017-12-05T11:45:42Z
init pr
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]