[ 
https://issues.apache.org/jira/browse/SYSTEMML-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15985693#comment-15985693
 ] 

Mike Dusenberry commented on SYSTEMML-1563:
-------------------------------------------

cc [~nakul02], [~niketanpansare], [~prithvi_r_s], [~reinwald]

> Add a distributed synchronous SGD MNIST LeNet example
> -----------------------------------------------------
>
>                 Key: SYSTEMML-1563
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1563
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Mike Dusenberry
>            Assignee: Mike Dusenberry
>
> This aims to add a distributed synchronous SGD MNIST LeNet example.  In 
> distributed synchronous SGD, multiple mini-batches are run forward & backward 
> simultaneously, and the gradients are aggregated together by addition before 
> the model parameters are updated.  This is mathematically equivalent to 
> simply using a large mini-batch size, i.e. {{new_mini_batch_size = 
> mini_batch_size * number_of_parallel_mini_batches}}.  The benefit is that 
> distributed synchronous SGD can make use of multiple devices, i.e. multiple 
> GPUs or multiple CPU machines, and thus can speed up training time.  More 
> specifically, using an effectively larger mini-batch size can yield a more 
> stable gradient in expectation, and a larger number of epochs can be run in 
> the same amount of time, both of which lead to faster convergence.  
> Alternatives include various forms of distributed *asynchronous* SGD, such as 
> Downpour, Hogwild, etc.  However, a recent paper \[1] from Google Brain / 
> Open AI has found evidence supporting the claim that distributed synchronous 
> SGD can lead to faster convergence, particularly if it is extending with the 
> notion of "backup workers" as described in the paper.
> We will first aim for distributed synchronous SGD with no backup workers, and 
> then extend this to include backup workers.  The MNIST LeNet model will 
> simply serve as an example, and this same approach can be extended to more 
> recent models, such as resnets.
> \[1]: https://arxiv.org/abs/1604.00981



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to