[
https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16101452#comment-16101452
]
Nick Pentreath commented on SPARK-21535:
----------------------------------------
Parallel CV is in progress: https://github.com/apache/spark/pull/16774. How
will this work with that?
> Reduce memory requirement for CrossValidator and TrainValidationSplit
> ----------------------------------------------------------------------
>
> Key: SPARK-21535
> URL: https://issues.apache.org/jira/browse/SPARK-21535
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.2.0
> Reporter: yuhao yang
>
> CrossValidator and TrainValidationSplit both use
> {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where
> epm is Array[ParamMap].
> Even though the training process is sequential, current implementation
> consumes extra driver memory for holding the trained models, which is not
> necessary and often leads to memory exception for both CrossValidator and
> TrainValidationSplit. My proposal is to optimize the training implementation,
> thus that used model can be collected by GC, and avoid the unnecessary OOM
> exceptions.
> E.g. when grid search space is 12, old implementation needs to hold all 12
> trained models in the driver memory at the same time, while the new
> implementation only needs to hold 1 trained model at a time, and previous
> model can be cleared by GC.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]