[
https://issues.apache.org/jira/browse/SYSTEMML-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Glenn Weidner updated SYSTEMML-1563:
------------------------------------
Fix Version/s: (was: SystemML 1.0)
SystemML 0.15
> Add a distributed synchronous SGD MNIST LeNet example
> -----------------------------------------------------
>
> Key: SYSTEMML-1563
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1563
> Project: SystemML
> Issue Type: New Feature
> Reporter: Mike Dusenberry
> Assignee: Mike Dusenberry
> Fix For: SystemML 0.15
>
>
> This aims to add a *distributed synchronous SGD* MNIST LeNet example. In
> distributed synchronous SGD, multiple mini-batches are run forward & backward
> simultaneously, and the gradients are aggregated together by addition before
> the model parameters are updated. This is mathematically equivalent to
> simply using a large mini-batch size, i.e. {{new_mini_batch_size =
> mini_batch_size * number_of_parallel_mini_batches}}. The benefit is that
> distributed synchronous SGD can make use of multiple devices, i.e. multiple
> GPUs or multiple CPU machines, and thus can speed up training time. More
> specifically, using an effectively larger mini-batch size can yield a more
> stable gradient in expectation, and a larger number of epochs can be run in
> the same amount of time, both of which lead to faster convergence.
> Alternatives include various forms of distributed _asynchronous_ SGD, such as
> Downpour, Hogwild, etc. However, a recent paper \[1] from Google Brain /
> Open AI has found evidence supporting the claim that distributed synchronous
> SGD can lead to faster convergence, particularly if it is extending with the
> notion of "backup workers" as described in the paper.
> We will first aim for distributed synchronous SGD with no backup workers, and
> then extend this to include backup workers. The MNIST LeNet model will
> simply serve as an example, and this same approach can be extended to more
> recent models, such as ResNets.
> \[1]: https://arxiv.org/abs/1604.00981
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)