[jira] [Created] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

David Leifker (JIRA) Thu, 16 Mar 2017 07:43:05 -0700

David Leifker created SPARK-19979:
-------------------------------------

             Summary: [MLLIB] Multiple Estimators/Pipelines In CrossValidator
                 Key: SPARK-19979
                 URL: https://issues.apache.org/jira/browse/SPARK-19979
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 2.1.0
            Reporter: David Leifker



Update CrossValidator and TrainValidationSplit to be able to accept multiple 
pipelines and grid parameters for testing different algorithms and/or being 
able to better control tuning combinations. Maintains backwards compatible API 
and reads legacy serialized objects.

The same could be done using an external iterative approach. Build different 
pipelines, throwing each into a CrossValidator, and then taking the best model 
from each of those CrossValidators. Then finally picking the best from those. 
This is the initial approach I explored. It resulted in a lot of boiler plate 
code that felt like it shouldn't need to exist if the api simply allowed for 
arrays of estimators and their parameters.

A couple advantages to this implementation to consider come from keeping the 
functional interface to the CrossValidator.

1. The caching of the folds is better utilized. An external iterative approach 
creates a new set of k folds for each CrossValidator fit and the folds are 
discarded after each CrossValidator run. In this implementation a single set of 
k folds is created and cached for all of the pipelines.

2. A potential advantage of using this implementation is for future 
parallelization of the pipelines within the CrossValdiator. It is of course 
possible to handle the parallelization outside of the CrossValidator here too, 
however I believe there is already work in progress to parallelize the grid 
parameters and that could be extended to multiple pipelines.

Both of those behind-the-scene optimizations are possible because of providing 
the CrossValidator with the data and the complete set of pipelines/estimators 
to evaluate up front allowing one to abstract away the implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

Reply via email to