[ https://issues.apache.org/jira/browse/SYSTEMML-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985694#comment-15985694 ]
Mike Dusenberry commented on SYSTEMML-1563: ------------------------------------------- [PR 442 | https://github.com/apache/incubator-systemml/pull/442] submitted. > Add a distributed synchronous SGD MNIST LeNet example > ----------------------------------------------------- > > Key: SYSTEMML-1563 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1563 > Project: SystemML > Issue Type: Sub-task > Reporter: Mike Dusenberry > Assignee: Mike Dusenberry > > This aims to add a distributed synchronous SGD MNIST LeNet example. In > distributed synchronous SGD, multiple mini-batches are run forward & backward > simultaneously, and the gradients are aggregated together by addition before > the model parameters are updated. This is mathematically equivalent to > simply using a large mini-batch size, i.e. {{new_mini_batch_size = > mini_batch_size * number_of_parallel_mini_batches}}. The benefit is that > distributed synchronous SGD can make use of multiple devices, i.e. multiple > GPUs or multiple CPU machines, and thus can speed up training time. More > specifically, using an effectively larger mini-batch size can yield a more > stable gradient in expectation, and a larger number of epochs can be run in > the same amount of time, both of which lead to faster convergence. > Alternatives include various forms of distributed *asynchronous* SGD, such as > Downpour, Hogwild, etc. However, a recent paper \[1] from Google Brain / > Open AI has found evidence supporting the claim that distributed synchronous > SGD can lead to faster convergence, particularly if it is extending with the > notion of "backup workers" as described in the paper. > We will first aim for distributed synchronous SGD with no backup workers, and > then extend this to include backup workers. The MNIST LeNet model will > simply serve as an example, and this same approach can be extended to more > recent models, such as resnets. > \[1]: https://arxiv.org/abs/1604.00981 -- This message was sent by Atlassian JIRA (v6.3.15#6346)