[
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176773#comment-16176773
]
Bryan Cutler commented on SPARK-19357:
--------------------------------------
[~josephkb] I think trying to push down the parallelism to the estimators might
end up making things difficult. Each model-specific optimization would have to
implement some kind of parallelization, and for pipelines it could get really
messy. As [~WeichenXu123] pointed out there could be memory problems too.
It could be possible to still use the current parallelism and still allow for
model-specific optimizations. For example, if we doing cross validation and
have a param map with {{regParam = (0.1, 0.3) and maxIter = (5, 10)}}. Lets
say that the cross validator could know that maxIter is optimized for the model
being evaluated (e.g. a new method in Estimator that return such params). It
would then be straightforward for the cross validator to remove maxIter from
the param map that will be parallelized over and use it to create 2 arrays of
paramMaps: {{((regParam=0.1, maxIter=5), (regParam=0.1, maxIter=10))}} and
{{((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10))}}. It could then fit
these 2 in parallel with calls to {{def fit(dataset: Dataset\[_\], paramMaps:
Array\[ParamMap\]): Seq\[M\]}}.
Hopefully that makes sense. In short, it would require some simple changes to
CrossValidator and something like a new method to return a list of
model-specific optimized params, like {{def getOptimizedParams():
Array\[Param\[_\]\] = Array.empty}} in {{Estimator}} that could be overridden
as required.
> Parallel Model Evaluation for ML Tuning: Scala
> ----------------------------------------------
>
> Key: SPARK-19357
> URL: https://issues.apache.org/jira/browse/SPARK-19357
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Reporter: Bryan Cutler
> Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: parallelism-verification-test.pdf
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline
> Tuning to perform model evaluation in parallel. A simple approach is to
> naively evaluate with a possible parameter to control the level of
> parallelism. There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism. 1 will evaluate
> all models in serial, as is done currently. Higher values could lead to
> excessive caching.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]