[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

Erik Erlandson (JIRA) Tue, 15 Jul 2014 10:17:37 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062339#comment-14062339
 ]


Erik Erlandson commented on SPARK-1486:
---------------------------------------

Does the dev on this issue effectively subsume SPARK-1457  and/or  SPARK-1856 ?


> Support multi-model training in MLlib
> -------------------------------------
>
>                 Key: SPARK-1486
>                 URL: https://issues.apache.org/jira/browse/SPARK-1486
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>             Fix For: 1.1.0
>
>
> It is rare in practice to train just one model with a given set of 
> parameters. Usually, this is done by training multiple models with different 
> sets of parameters and then select the best based on their performance on the 
> validation set. MLlib should provide native support for multi-model 
> training/scoring. It requires decoupling of concepts like problem, 
> formulation, algorithm, parameter set, and model, which are missing in MLlib 
> now. MLI implements similar concepts, which we can borrow. There are 
> different approaches for multi-model training:
> 0) Keep one copy of the data, and train models one after another (or maybe in 
> parallel, depending on the scheduler).
> 1) Keep one copy of the data, and train multiple models at the same time 
> (similar to `runs` in KMeans).
> 2) Make multiple copies of the data (still stored distributively), and use 
> more cores to distribute the work.
> 3) Collect the data, make the entire dataset available on workers, and train 
> one or more models on each worker.
> Users should be able to choose which execution mode they want to use. Note 
> that 3) could cover many use cases in practice when the training data is not 
> huge, e.g., <1GB.
> This task will be divided into sub-tasks and this JIRA is created to discuss 
> the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1486) Support multi-model training in MLlib

Reply via email to