[
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320038#comment-14320038
]
Peter Rudenko commented on SPARK-4766:
--------------------------------------
Very important feature that could make pretty big speedup. Let me explain why.
I have a pipeline with 4 transformers and 1 estimator model
(LogisticRegression) with 3 folds for cross validation and 3 hyper parameters
in grid search:
{code}
val paramGrid = new ParamGridBuilder()
.addGrid(model.regParam, Array(0.1, 0.01, 0.001))
.build()
crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(3)
{code}
Transformers don't have any parameters in grid search. Right now for every
possible combination of hyperparam + crossvalidation fold it transforms a data
(with the same transformers) thus creating new RDD with a new ID, but the same
data. Thus i cannot cache it. What i come with is to use 2 pipelines:
# Transformer pipeline - transforming once whole data
# Model pipeline with just a model in it.
I modified [Pipeline|https://issues.apache.org/jira/browse/SPARK-5796] and
LogisticRegression class (commented instances.unpersist() because the same
instances would be for each hyperparameter). This reduced the time of
LogisticRegression Pipeline significantly.
But would be cool to do it in Pipeline: if there's no parameters for
Transformer stages - just construct a data once and for each hyperparameter in
estimator pass the same data. Thus for 3 folds it would read and cache data 3
times ((1 to 3).combination(2)) and wouldn't depend on number of
Hyperparameters to estimator (now it's doing 9 times 3 folds combination * 3
model parameters).
> ML Estimator Params should subclass Transformer Params
> ------------------------------------------------------
>
> Key: SPARK-4766
> URL: https://issues.apache.org/jira/browse/SPARK-4766
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 1.2.0
> Reporter: Joseph K. Bradley
>
> Currently, in spark.ml, both Transformers and Estimators extend the same
> Params classes. There should be one Params class for the Transformer and one
> for the Estimator, where the Estimator params class extends the Transformer
> one.
> E.g., it is weird to be able to do:
> {code}
> val model: LogisticRegressionModel = ...
> model.getMaxIter()
> {code}
> (This is the only case where this happens currently, but it is worth setting
> a precedent.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]