[ https://issues.apache.org/jira/browse/SPARK-21535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
yuhao yang resolved SPARK-21535. -------------------------------- Resolution: Not A Problem The new implementation will load the evaluation dataset when training model and may not always present a better performance. Please refer to the discussion in the PR. > Reduce memory requirement for CrossValidator and TrainValidationSplit > ---------------------------------------------------------------------- > > Key: SPARK-21535 > URL: https://issues.apache.org/jira/browse/SPARK-21535 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.2.0 > Reporter: yuhao yang > > CrossValidator and TrainValidationSplit both use > {code}models = est.fit(trainingDataset, epm) {code} to fit the models, where > epm is Array[ParamMap]. > Even though the training process is sequential, current implementation > consumes extra driver memory for holding the trained models, which is not > necessary and often leads to memory exception for both CrossValidator and > TrainValidationSplit. My proposal is to optimize the training implementation, > thus that used model can be collected by GC, and avoid the unnecessary OOM > exceptions. > E.g. when grid search space is 12, old implementation needs to hold all 12 > trained models in the driver memory at the same time, while the new > implementation only needs to hold 1 trained model at a time, and previous > model can be cleared by GC. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org