[
https://issues.apache.org/jira/browse/SYSTEMML-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020432#comment-16020432
]
Mike Dusenberry commented on SYSTEMML-1563:
-------------------------------------------
Merged in [commit 6ad5509 |
https://github.com/apache/incubator-systemml/commit/6ad5509bd23d45bfc5ea65e23dd956caacfa7c76].
> Add a distributed synchronous SGD MNIST LeNet example
> -----------------------------------------------------
>
> Key: SYSTEMML-1563
> URL: https://issues.apache.org/jira/browse/SYSTEMML-1563
> Project: SystemML
> Issue Type: Sub-task
> Reporter: Mike Dusenberry
> Assignee: Mike Dusenberry
> Fix For: SystemML 1.0
>
>
> This aims to add a *distributed synchronous SGD* MNIST LeNet example. In
> distributed synchronous SGD, multiple mini-batches are run forward & backward
> simultaneously, and the gradients are aggregated together by addition before
> the model parameters are updated. This is mathematically equivalent to
> simply using a large mini-batch size, i.e. {{new_mini_batch_size =
> mini_batch_size * number_of_parallel_mini_batches}}. The benefit is that
> distributed synchronous SGD can make use of multiple devices, i.e. multiple
> GPUs or multiple CPU machines, and thus can speed up training time. More
> specifically, using an effectively larger mini-batch size can yield a more
> stable gradient in expectation, and a larger number of epochs can be run in
> the same amount of time, both of which lead to faster convergence.
> Alternatives include various forms of distributed _asynchronous_ SGD, such as
> Downpour, Hogwild, etc. However, a recent paper \[1] from Google Brain /
> Open AI has found evidence supporting the claim that distributed synchronous
> SGD can lead to faster convergence, particularly if it is extending with the
> notion of "backup workers" as described in the paper.
> We will first aim for distributed synchronous SGD with no backup workers, and
> then extend this to include backup workers. The MNIST LeNet model will
> simply serve as an example, and this same approach can be extended to more
> recent models, such as ResNets.
> \[1]: https://arxiv.org/abs/1604.00981
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)