Github user petro-rudenko commented on the pull request: https://github.com/apache/spark/pull/4593#issuecomment-75550855 @dbtsai, @joshdevins here's an issue i have. I'm using new ml pipeline with hyperparameter grid search. Because folds doesn't depend from hyperparameter, i've reimplemented a bit LogisticRegression to not unpersist data: ```scala class CustomLogisticRegression extends LogisticRegression { var oldInstances: RDD[LabeledPoint] = null override def fit(dataset: SchemaRDD, paramMap: ParamMap): LogisticRegressionModel = { println(s"Fitting dataset ${dataset.id} with ParamMap $paramMap.") transformSchema(dataset.schema, paramMap, logging = true) import dataset.sqlContext._ val map = this.paramMap ++ paramMap val instances = dataset.select(map(labelCol).attr, map(featuresCol).attr) .map { case Row(label: Double, features: Vector) => LabeledPoint(label, features) } //For parallel grid search this.synchronized({ if (oldInstances == null || oldInstances.id != instances.id) { if (oldInstances != null) { oldInstances.unpersist() } oldInstances = instances instances.setName(s"Instances for LR with ParamMap $paramMap and RDD ${dataset.id}") instances.persist(StorageLevel.MEMORY_AND_DISK) } }) val lr = (new LogisticRegressionWithLBFGS) .setValidateData(false) lr.optimizer .setRegParam(map(regParam)) .setNumIterations(map(maxIter)) val lrOldModel = lr.run(instances) val lrm = new LogisticRegressionModel(this, map, lr.run(instances).weights) //instances.unpersist() // copy model params Params.inheritValues(map, this, lrm) lrm } } ``` Then for 3 folds in crossvalidation and 3 hyperparameters to LogisticRegression i got something like this: ``` Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.5 } Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.1 } Fitting dataset 11 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.01 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.5 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.1 } Fitting dataset 12 with ParamMap { CustomLogisticRegression-f35ae4d3-regParam: 0.01 } ``` So persistence on the model level need to cache folds for hyperparameters grid search, but persistence on GLM level need to speed-up Standart scalar transformation etc. Don't know yet how to do this efficiently without double caching.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org