GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/18733
[SPARK-21535][ML]Reduce memory requirement for CrossValidator and
TrainValidationSplit
## What changes were proposed in this pull request?
CrossValidator and TrainValidationSplit both use
`models = est.fit(trainingDataset, epm) `
to fit the models, where epm is `Array[ParamMap]`.
Even though the training process is sequential, current implementation
consumes extra driver memory for holding the trained models, which is not
necessary and often leads to memory exception for both CrossValidator and
TrainValidationSplit. My proposal is to optimize the training implementation,
thus that used local model can be collected by GC, and avoid the unnecessary
OOM exceptions.
E.g. when grid search space is 12, old implementation needs to hold all 12
trained models in the driver memory at the same time, while the new
implementation only needs to hold 1 trained model at a time, and previous model
can be cleared by GC
## How was this patch tested?
Existing unit test since there's no change to logic.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark singleModel
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18733.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18733
----
commit a7667e72d78f679b9693e22742e8a624b6348fd2
Author: Yuhao Yang <[email protected]>
Date: 2017-07-25T21:41:17Z
memory optimization
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]