[ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320038#comment-14320038
 ] 

Peter Rudenko commented on SPARK-4766:
--------------------------------------

Very important feature that could make pretty big speedup. Let me explain why. 
I have a pipeline with 4 transformers and 1 estimator model 
(LogisticRegression) with 3 folds for cross validation and 3 hyper parameters 
in grid search:

{code}
val paramGrid = new ParamGridBuilder()
      .addGrid(model.regParam, Array(0.1, 0.01, 0.001))
      .build()

crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(3)
{code}

Transformers don't have any parameters in grid search. Right now for every 
possible combination of hyperparam + crossvalidation fold it transforms a data 
(with the same transformers) thus creating new RDD with a new ID, but the same 
data. Thus i cannot cache it. What i come with is to use 2 pipelines: 
# Transformer pipeline - transforming once whole data 
# Model pipeline with just a model in it.

I modified [Pipeline|https://issues.apache.org/jira/browse/SPARK-5796] and 
LogisticRegression class (commented instances.unpersist() because the same 
instances would be for each hyperparameter). This reduced the time of 
LogisticRegression Pipeline significantly.

But would be cool to do it in Pipeline: if there's no parameters for 
Transformer stages - just construct a data once and for each hyperparameter in 
estimator pass the same data. Thus for 3 folds it would read and cache data 3 
times ((1 to 3).combination(2)) and wouldn't depend on number of 
Hyperparameters to estimator (now it's doing 9 times 3 folds combination * 3 
model parameters).


> ML Estimator Params should subclass Transformer Params
> ------------------------------------------------------
>
>                 Key: SPARK-4766
>                 URL: https://issues.apache.org/jira/browse/SPARK-4766
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> Currently, in spark.ml, both Transformers and Estimators extend the same 
> Params classes.  There should be one Params class for the Transformer and one 
> for the Estimator, where the Estimator params class extends the Transformer 
> one.
> E.g., it is weird to be able to do:
> {code}
> val model: LogisticRegressionModel = ...
> model.getMaxIter()
> {code}
> (This is the only case where this happens currently, but it is worth setting 
> a precedent.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to