[
https://issues.apache.org/jira/browse/SYSTEMML-1760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mike Dusenberry reassigned SYSTEMML-1760:
-----------------------------------------
Assignee: Fei Hu (was: Mike Dusenberry)
> Improve engine robustness of distributed SGD training
> -----------------------------------------------------
>
> Key: SYSTEMML-1760
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1760
> Project: SystemML
> Issue Type: Improvement
> Components: Algorithms, Compiler, ParFor
> Reporter: Mike Dusenberry
> Assignee: Fei Hu
>
> Currently, we have a mathematical framework in place for training with
> distributed SGD in a [distributed MNIST LeNet example |
> https://github.com/apache/systemml/blob/master/scripts/nn/examples/mnist_lenet_distrib_sgd.dml].
> This task aims to push this at scale to determine (1) the current behavior
> of the engine (i.e. does the optimizer actually run this in a distributed
> fashion, and (2) ways to improve the robustness and performance for this
> scenario. The distributed SGD framework from this example has already been
> ported into Caffe2DML, and thus improvements made for this task will directly
> benefit our efforts towards distributed training of Caffe models (and Keras
> in the future).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)