Github user petro-rudenko commented on the pull request:
https://github.com/apache/spark/pull/4593#issuecomment-75550855
@dbtsai, @joshdevins here's an issue i have. I'm using new ml pipeline
with hyperparameter grid search. Because folds doesn't depend from
hyperparameter, i've reimplemented a bit LogisticRegression to not unpersist
data:
```scala
class CustomLogisticRegression extends LogisticRegression {
var oldInstances: RDD[LabeledPoint] = null
override def fit(dataset: SchemaRDD, paramMap: ParamMap):
LogisticRegressionModel = {
println(s"Fitting dataset ${dataset.id} with ParamMap $paramMap.")
transformSchema(dataset.schema, paramMap, logging = true)
import dataset.sqlContext._
val map = this.paramMap ++ paramMap
val instances = dataset.select(map(labelCol).attr,
map(featuresCol).attr)
.map {
case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)
}
//For parallel grid search
this.synchronized({
if (oldInstances == null || oldInstances.id != instances.id) {
if (oldInstances != null) {
oldInstances.unpersist()
}
oldInstances = instances
instances.setName(s"Instances for LR with ParamMap $paramMap and
RDD ${dataset.id}")
instances.persist(StorageLevel.MEMORY_AND_DISK)
}
})
val lr = (new LogisticRegressionWithLBFGS)
.setValidateData(false)
lr.optimizer
.setRegParam(map(regParam))
.setNumIterations(map(maxIter))
val lrOldModel = lr.run(instances)
val lrm = new LogisticRegressionModel(this, map,
lr.run(instances).weights)
//instances.unpersist()
// copy model params
Params.inheritValues(map, this, lrm)
lrm
}
}
```
Then for 3 folds in crossvalidation and 3 hyperparameters to
LogisticRegression i got something like this:
```
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.5
}
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.1
}
Fitting dataset 11 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.01
}
Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.5
}
Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.1
}
Fitting dataset 12 with ParamMap {
CustomLogisticRegression-f35ae4d3-regParam: 0.01
}
```
So persistence on the model level need to cache folds for hyperparameters
grid search, but persistence on GLM level need to speed-up Standart scalar
transformation etc. Don't know yet how to do this efficiently without double
caching.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]