[ 
https://issues.apache.org/jira/browse/SINGA-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632336#comment-14632336
 ] 

ASF subversion and git services commented on SINGA-32:
------------------------------------------------------

Commit 585e275fdf050db25eb9c583fb54ae39714d9b20 in incubator-singa's branch 
refs/heads/master from wang wei
[ https://git-wip-us.apache.org/repos/asf?p=incubator-singa.git;h=585e275 ]

SINGA-32 Implement Synchronous training frameworks

For the synchronous training frameworks, one worker group and one server group 
are launched.
Gradients for the same Param are aggregated locally at each process's stub.
The server conducts update until receive all gradients for the same Param 
(slice).
After udpate, the server sends back new Param (slice) values to every process 
who has sent update request.
The worker_shard_ and server_shard consist of ParamEntrys, each of which stores 
the information of one unique Param (slice), e.g.,
the number of shares of each Param (slice), and the local shares for each Param 
(slice).

The Msg class is improved to have clean/simple API. The msg header now includes 
a src (int), a dst (int) and a trgt (int value and int version),
representing the source addr, destination addr and target of the msg. The 
address is constructed by the
entity who creates the msg. Any addr is valid as long as it is unique for one 
entity.
Function Addr(int grp, int id_or_proc, int type) is provided to construct the 
addr using
group ID, worker/server ID (or procs ID) and entity type (kServer, kStub, 
etc.). Functions are also provided to extract
the group, worker/server ID from the addr (int). Similarly, the target field 
can be constructed using ParamTrgt function
which wraps the Param ID and Slice ID into a target value (int). ParamID() and 
SliceID() are to extract the info from target value.


> Implement AllReduce training framework
> --------------------------------------
>
>                 Key: SINGA-32
>                 URL: https://issues.apache.org/jira/browse/SINGA-32
>             Project: Singa
>          Issue Type: New Feature
>            Reporter: wangwei
>            Assignee: wangwei
>
> The AllReduce training framework runs in synchronous mode, where one worker 
> starts the next iteration after all workers have finished the previous 
> iteration. Baidu's deepimage system uses this training framework.
> To implement it in SINGA, we launch one worker group and one server group. 
> The model is partitioned (e.g., on dimension 0) among all workers. Params are 
> sliced and partitioned among all servers. 
> At the beginning, each Param (slice) is put into server shard including 
> number of workers computing gradient for it.
> For each iteration, the local stub aggregates all gradients for the same 
> Param and sends to corresponding server including the number of local workers 
> computing gradient for it. The server will buffer update requests and 
> conducts update for a Param slice until it receives gradients from all 
> workers. It sends back the updated Param (slices) to the corresponding 
> process (stub).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to