[ 
https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2087:
--------------------------------
    Description: This part aims to implement the parameter server for spark 
distributed backend. In general, the implementation of ps is very close to 
local ps. The ps provides the pull/push service to the workers in driver node 
whereas the communication between ps and workers will be done vias RPC. And 
then the data needs to be distributed to the workers according to the different 
data partition schemes. The worker setup and cleanup is different from the 
local one which needs to be handled.   (was: This part aims to implement the 
parameter server for spark distributed backend. In general, we could launch a 
parameter server in a host to provide the pull and push service. For the 
moment, all the weights and biases are saved in a hashmap using a key, e.g., 
"global parameter". Each worker's gradients will be put into the hashmap 
seperately with a given key. And the exchange between server and workers will 
be implemented by netty RPC. Hence, we could easily broadcast the IP address 
and the port number to the workers. And then the workers can send the gradients 
and retrieve the new parameters via netty RPC. The server will also spawn a 
thread which retrieves the gradients by polling the hashmap using relevant keys 
and aggregates them. At last, it updates the global parameter in the hashmap.)

> Initial version of distributed spark backend
> --------------------------------------------
>
>                 Key: SYSTEMML-2087
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2087
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>
> This part aims to implement the parameter server for spark distributed 
> backend. In general, the implementation of ps is very close to local ps. The 
> ps provides the pull/push service to the workers in driver node whereas the 
> communication between ps and workers will be done vias RPC. And then the data 
> needs to be distributed to the workers according to the different data 
> partition schemes. The worker setup and cleanup is different from the local 
> one which needs to be handled. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to