[
https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-19979:
------------------------------------
Assignee: Apache Spark
> [MLLIB] Multiple Estimators/Pipelines In CrossValidator
> -------------------------------------------------------
>
> Key: SPARK-19979
> URL: https://issues.apache.org/jira/browse/SPARK-19979
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 2.1.0
> Reporter: David Leifker
> Assignee: Apache Spark
>
> Update CrossValidator and TrainValidationSplit to be able to accept multiple
> pipelines and grid parameters for testing different algorithms and/or being
> able to better control tuning combinations. Maintains backwards compatible
> API and reads legacy serialized objects.
> The same could be done using an external iterative approach. Build different
> pipelines, throwing each into a CrossValidator, and then taking the best
> model from each of those CrossValidators. Then finally picking the best from
> those. This is the initial approach I explored. It resulted in a lot of
> boiler plate code that felt like it shouldn't need to exist if the api simply
> allowed for arrays of estimators and their parameters.
> A couple advantages to this implementation to consider come from keeping the
> functional interface to the CrossValidator.
> 1. The caching of the folds is better utilized. An external iterative
> approach creates a new set of k folds for each CrossValidator fit and the
> folds are discarded after each CrossValidator run. In this implementation a
> single set of k folds is created and cached for all of the pipelines.
> 2. A potential advantage of using this implementation is for future
> parallelization of the pipelines within the CrossValdiator. It is of course
> possible to handle the parallelization outside of the CrossValidator here
> too, however I believe there is already work in progress to parallelize the
> grid parameters and that could be extended to multiple pipelines.
> Both of those behind-the-scene optimizations are possible because of
> providing the CrossValidator with the data and the complete set of
> pipelines/estimators to evaluate up front allowing one to abstract away the
> implementation.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]