[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320038#comment-14320038 ]
Peter Rudenko commented on SPARK-4766: -------------------------------------- Very important feature that could make pretty big speedup. Let me explain why. I have a pipeline with 4 transformers and 1 estimator model (LogisticRegression) with 3 folds for cross validation and 3 hyper parameters in grid search: {code} val paramGrid = new ParamGridBuilder() .addGrid(model.regParam, Array(0.1, 0.01, 0.001)) .build() crossval.setEstimatorParamMaps(paramGrid) crossval.setNumFolds(3) {code} Transformers don't have any parameters in grid search. Right now for every possible combination of hyperparam + crossvalidation fold it transforms a data (with the same transformers) thus creating new RDD with a new ID, but the same data. Thus i cannot cache it. What i come with is to use 2 pipelines: # Transformer pipeline - transforming once whole data # Model pipeline with just a model in it. I modified [Pipeline|https://issues.apache.org/jira/browse/SPARK-5796] and LogisticRegression class (commented instances.unpersist() because the same instances would be for each hyperparameter). This reduced the time of LogisticRegression Pipeline significantly. But would be cool to do it in Pipeline: if there's no parameters for Transformer stages - just construct a data once and for each hyperparameter in estimator pass the same data. Thus for 3 folds it would read and cache data 3 times ((1 to 3).combination(2)) and wouldn't depend on number of Hyperparameters to estimator (now it's doing 9 times 3 folds combination * 3 model parameters). > ML Estimator Params should subclass Transformer Params > ------------------------------------------------------ > > Key: SPARK-4766 > URL: https://issues.apache.org/jira/browse/SPARK-4766 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 1.2.0 > Reporter: Joseph K. Bradley > > Currently, in spark.ml, both Transformers and Estimators extend the same > Params classes. There should be one Params class for the Transformer and one > for the Estimator, where the Estimator params class extends the Transformer > one. > E.g., it is weird to be able to do: > {code} > val model: LogisticRegressionModel = ... > model.getMaxIter() > {code} > (This is the only case where this happens currently, but it is worth setting > a precedent.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org