Github user petro-rudenko commented on the pull request:

    https://github.com/apache/spark/pull/4593#issuecomment-75550855
  
    @dbtsai, @joshdevins  here's an issue i have. I'm using new ml pipeline 
with hyperparameter grid search. Because folds doesn't depend from 
hyperparameter, i've reimplemented a bit LogisticRegression to not unpersist 
data:
    ```scala
    class CustomLogisticRegression extends LogisticRegression {
      var oldInstances: RDD[LabeledPoint] = null
      
      override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
LogisticRegressionModel = {
        println(s"Fitting dataset ${dataset.id} with ParamMap $paramMap.")
        transformSchema(dataset.schema, paramMap, logging = true)
        import dataset.sqlContext._
        val map = this.paramMap ++ paramMap
        val instances = dataset.select(map(labelCol).attr, 
map(featuresCol).attr)
          .map {
            case Row(label: Double, features: Vector) =>
              LabeledPoint(label, features)
          }
    
        //For parallel grid search 
        this.synchronized({
          if (oldInstances == null || oldInstances.id != instances.id) {
            if (oldInstances != null) {
              oldInstances.unpersist()
            }
            oldInstances = instances
            instances.setName(s"Instances for LR with ParamMap $paramMap and 
RDD ${dataset.id}")
            instances.persist(StorageLevel.MEMORY_AND_DISK)
          }
        })
    
        val lr = (new LogisticRegressionWithLBFGS)
          .setValidateData(false)
    
        lr.optimizer
          .setRegParam(map(regParam))
          .setNumIterations(map(maxIter))
        val lrOldModel = lr.run(instances)
        val lrm = new LogisticRegressionModel(this, map, 
lr.run(instances).weights)
        //instances.unpersist()
        // copy model params
        Params.inheritValues(map, this, lrm)
        lrm
      }
    }
    ```
    
    Then for 3 folds in crossvalidation and 3 hyperparameters to 
LogisticRegression i got something like this:
    
    ```
    Fitting dataset 11 with ParamMap {
        CustomLogisticRegression-f35ae4d3-regParam: 0.5
    }
    Fitting dataset 11 with ParamMap {
        CustomLogisticRegression-f35ae4d3-regParam: 0.1
    }
    Fitting dataset 11 with ParamMap {
        CustomLogisticRegression-f35ae4d3-regParam: 0.01
    }
    
    Fitting dataset 12 with ParamMap {
        CustomLogisticRegression-f35ae4d3-regParam: 0.5
    }
    Fitting dataset 12 with ParamMap {
        CustomLogisticRegression-f35ae4d3-regParam: 0.1
    }
    Fitting dataset 12 with ParamMap {
        CustomLogisticRegression-f35ae4d3-regParam: 0.01
    }
    ```
    
    So persistence on the model level need to cache folds for hyperparameters 
grid search, but persistence on GLM level need to speed-up Standart scalar 
transformation etc. Don't know yet how to do this efficiently without double 
caching.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to