[jira] [Updated] (SYSTEMML-1563) Add a distributed synchronous SGD MNIST LeNet example

Glenn Weidner (JIRA) Fri, 08 Sep 2017 22:07:35 -0700

     [ 
https://issues.apache.org/jira/browse/SYSTEMML-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Glenn Weidner updated SYSTEMML-1563:
------------------------------------
    Fix Version/s:     (was: SystemML 1.0)
                   SystemML 0.15

> Add a distributed synchronous SGD MNIST LeNet example
> -----------------------------------------------------
>
>                 Key: SYSTEMML-1563
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1563
>             Project: SystemML
>          Issue Type: New Feature
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>             Fix For: SystemML 0.15
>
>
> This aims to add a *distributed synchronous SGD* MNIST LeNet example.  In 
> distributed synchronous SGD, multiple mini-batches are run forward & backward 
> simultaneously, and the gradients are aggregated together by addition before 
> the model parameters are updated.  This is mathematically equivalent to 
> simply using a large mini-batch size, i.e. {{new_mini_batch_size = 
> mini_batch_size * number_of_parallel_mini_batches}}.  The benefit is that 
> distributed synchronous SGD can make use of multiple devices, i.e. multiple 
> GPUs or multiple CPU machines, and thus can speed up training time.  More 
> specifically, using an effectively larger mini-batch size can yield a more 
> stable gradient in expectation, and a larger number of epochs can be run in 
> the same amount of time, both of which lead to faster convergence.  
> Alternatives include various forms of distributed _asynchronous_ SGD, such as 
> Downpour, Hogwild, etc.  However, a recent paper \[1] from Google Brain / 
> Open AI has found evidence supporting the claim that distributed synchronous 
> SGD can lead to faster convergence, particularly if it is extending with the 
> notion of "backup workers" as described in the paper.
> We will first aim for distributed synchronous SGD with no backup workers, and 
> then extend this to include backup workers.  The MNIST LeNet model will 
> simply serve as an example, and this same approach can be extended to more 
> recent models, such as ResNets.
> \[1]: https://arxiv.org/abs/1604.00981



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SYSTEMML-1563) Add a distributed synchronous SGD MNIST LeNet example

Reply via email to