[ 
https://issues.apache.org/jira/browse/SYSTEMML-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dusenberry updated SYSTEMML-1159:
--------------------------------------
    Description: 
Training a parameterized machine learning model (such as a large neural net in 
deep learning) requires learning a set of ideal model parameters from the data, 
as well as determining appropriate hyperparameters (or "settings") for the 
training process itself.  In the latter case, the hyperparameters (i.e. 
learning rate, regularization strength, dropout percentage, model architecture, 
etc.) can not be learned from the data, and instead are determined via a search 
across a space for each hyperparameter.  For large numbers of hyperparameters 
(such as in deep learning models), the current literature points to performing 
staged, randomized grid searches over the space to produce distributions of 
performance, narrowing the space after each search \[1].  Thus, for efficient 
hyperparameter optimization, it is desirable to train several models in 
parallel, with each model trained over the full dataset.  For deep learning 
models, a mini-batch training approach is currently state-of-the-art, and thus 
separate models with different hyperparameters could, conceivably, be easily 
trained on each of the nodes in a cluster.

In order to allow for the training of deep learning models, SystemML needs to 
determine a solution to enable this scenario with the Spark backend.  
Specifically, if the user has a {{train}} function that takes a set of 
hyperparameters and trains a model with a mini-batch approach (and thus is only 
making use of single-node instructions within the function), the user should be 
able to wrap this function with, for example, a remote {{parfor}} construct 
that samples hyperparameters and calls the {{train}} function on each machine 
in parallel.

To be clear, each model would need access to the entire dataset, and each model 
would be trained independently.

\[1]: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

  was:
Training a parameterized machine learning model (such as a large neural net in 
deep learning) requires learning a set of ideal model parameters from the data, 
as well as determining appropriate hyperparameters (or "settings") for the 
training process itself.  In the latter case, the hyperparameters (i.e. 
learning rate, regularization strength, dropout percentage, model architecture, 
etc.) can not be learned from the data, and instead are determined via a search 
across a space for each hyperparameter.  For large numbers of hyperparameters 
(such as in deep learning models), the current literature points to performing 
staged, randomized grid searches over the space to produce distributions of 
performance, narrowing the space after each search \[1].  Thus, for efficient 
hyperparameter optimization, it is desirable to train several models in 
parallel, with each model trained over the full dataset.  For deep learning 
models, a mini-batch training approach is currently state-of-the-art, and thus 
separate models with different hyperparameters could, conceivably, be easily 
trained on each of the nodes in a cluster.

In order to allow for the training of deep learning models, SystemML needs to 
determine a solution to enable this scenario with the Spark backend.  
Specifically, if the user has a {{train}} function that takes a set of 
hyperparameters and trains a model with a mini-batch approach (and thus is only 
making use of single-node instructions within the function), the user should be 
able to wrap this function with, for example, a remote {{parfor}} construct 
that samples hyperparameters and calls the {{train}} function on each machine 
in parallel.  To be clear, each model would need access to the entire dataset.

\[1]: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf


> Enable Remote Hyperparameter Tuning
> -----------------------------------
>
>                 Key: SYSTEMML-1159
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1159
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>            Priority: Blocker
>
> Training a parameterized machine learning model (such as a large neural net 
> in deep learning) requires learning a set of ideal model parameters from the 
> data, as well as determining appropriate hyperparameters (or "settings") for 
> the training process itself.  In the latter case, the hyperparameters (i.e. 
> learning rate, regularization strength, dropout percentage, model 
> architecture, etc.) can not be learned from the data, and instead are 
> determined via a search across a space for each hyperparameter.  For large 
> numbers of hyperparameters (such as in deep learning models), the current 
> literature points to performing staged, randomized grid searches over the 
> space to produce distributions of performance, narrowing the space after each 
> search \[1].  Thus, for efficient hyperparameter optimization, it is 
> desirable to train several models in parallel, with each model trained over 
> the full dataset.  For deep learning models, a mini-batch training approach 
> is currently state-of-the-art, and thus separate models with different 
> hyperparameters could, conceivably, be easily trained on each of the nodes in 
> a cluster.
> In order to allow for the training of deep learning models, SystemML needs to 
> determine a solution to enable this scenario with the Spark backend.  
> Specifically, if the user has a {{train}} function that takes a set of 
> hyperparameters and trains a model with a mini-batch approach (and thus is 
> only making use of single-node instructions within the function), the user 
> should be able to wrap this function with, for example, a remote {{parfor}} 
> construct that samples hyperparameters and calls the {{train}} function on 
> each machine in parallel.
> To be clear, each model would need access to the entire dataset, and each 
> model would be trained independently.
> \[1]: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to