Mike Dusenberry created SYSTEMML-1159:
-----------------------------------------
Summary: Enable Remote Hyperparameter Tuning
Key: SYSTEMML-1159
URL: https://issues.apache.org/jira/browse/SYSTEMML-1159
Project: SystemML
Issue Type: Improvement
Reporter: Mike Dusenberry
Priority: Blocker
Training a parameterized machine learning model (such as a large neural net in
deep learning) requires learning a set of ideal model parameters from the data,
as well as determining appropriate hyperparameters (or "settings") for the
training process itself. In the latter case, the hyperparameters (i.e.
learning rate, regularization strength, dropout percentage, model architecture,
etc.) can not be learned from the data, and instead are determined via a search
across a space for each hyperparameter. For large numbers of hyperparameters
(such as in deep learning models), the current literature points to performing
staged, randomized grid searches over the space to produce distributions of
performance, narrowing the space after each search \[1]. Thus, for efficient
hyperparameter optimization, it is desirable to train several models in
parallel, with each model trained over the full dataset. For deep learning
models, a mini-batch training approach is currently state-of-the-art, and thus
separate models with different hyperparameters could, conceivably, be easily
trained on each of the nodes in a cluster.
In order to allow for the training of deep learning models, SystemML needs to
determine a solution to enable this scenario with the Spark backend.
Specifically, if the user has a {{train}} function that takes a set of
hyperparameters and trains a model with a mini-batch approach (and thus is only
making use of single-node instructions within the function), the user should be
able to wrap this function with, for example, a remote {{parfor}} construct
that samples hyperparameters and calls the {{train}} function on each machine
in parallel. To be clear, each model would need access to the entire dataset.
\[1]: http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)