[
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062339#comment-14062339
]
Erik Erlandson commented on SPARK-1486:
---------------------------------------
Does the dev on this issue effectively subsume SPARK-1457 and/or SPARK-1856 ?
> Support multi-model training in MLlib
> -------------------------------------
>
> Key: SPARK-1486
> URL: https://issues.apache.org/jira/browse/SPARK-1486
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Xiangrui Meng
> Assignee: Xiangrui Meng
> Priority: Critical
> Fix For: 1.1.0
>
>
> It is rare in practice to train just one model with a given set of
> parameters. Usually, this is done by training multiple models with different
> sets of parameters and then select the best based on their performance on the
> validation set. MLlib should provide native support for multi-model
> training/scoring. It requires decoupling of concepts like problem,
> formulation, algorithm, parameter set, and model, which are missing in MLlib
> now. MLI implements similar concepts, which we can borrow. There are
> different approaches for multi-model training:
> 0) Keep one copy of the data, and train models one after another (or maybe in
> parallel, depending on the scheduler).
> 1) Keep one copy of the data, and train multiple models at the same time
> (similar to `runs` in KMeans).
> 2) Make multiple copies of the data (still stored distributively), and use
> more cores to distribute the work.
> 3) Collect the data, make the entire dataset available on workers, and train
> one or more models on each worker.
> Users should be able to choose which execution mode they want to use. Note
> that 3) could cover many use cases in practice when the training data is not
> huge, e.g., <1GB.
> This task will be divided into sub-tasks and this JIRA is created to discuss
> the design and track the overall progress.
--
This message was sent by Atlassian JIRA
(v6.2#6252)