[ https://issues.apache.org/jira/browse/SYSTEMML-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
LI Guobao updated SYSTEMML-2087: -------------------------------- Description: This part aims to implement the parameter server for spark distributed backend. In general, the implementation of ps is very close to local ps. The ps provides the pull/push service to the workers in driver node whereas the communication between ps and workers will be done vias RPC. And then the data needs to be distributed to the workers according to the different data partition schemes. The worker setup and cleanup is different from the local one which needs to be handled. (was: This part aims to implement the parameter server for spark distributed backend. In general, we could launch a parameter server in a host to provide the pull and push service. For the moment, all the weights and biases are saved in a hashmap using a key, e.g., "global parameter". Each worker's gradients will be put into the hashmap seperately with a given key. And the exchange between server and workers will be implemented by netty RPC. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via netty RPC. The server will also spawn a thread which retrieves the gradients by polling the hashmap using relevant keys and aggregates them. At last, it updates the global parameter in the hashmap.) > Initial version of distributed spark backend > -------------------------------------------- > > Key: SYSTEMML-2087 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2087 > Project: SystemML > Issue Type: Sub-task > Reporter: Matthias Boehm > Assignee: LI Guobao > Priority: Major > > This part aims to implement the parameter server for spark distributed > backend. In general, the implementation of ps is very close to local ps. The > ps provides the pull/push service to the workers in driver node whereas the > communication between ps and workers will be done vias RPC. And then the data > needs to be distributed to the workers according to the different data > partition schemes. The worker setup and cleanup is different from the local > one which needs to be handled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)