[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
LI Guobao updated SYSTEMML-2085: -------------------------------- Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. A diagram of the parameter server architecture is shown below. was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. # For the case of local multi-thread parameter server, it is easy to maintain a concurrent hashmap (where the parameters as value accompanied with a defined key) inside the CP. And the workers are launched in multi-threaded way to execute the gradients calculation function and push the gradients to the hashmap. An another thread will be launched to pull the gradients from hashmap and call the aggregation function to update the parameters. # For the case of spark distributed backend, we could launch a remote single parameter server outside of CP (as a worker) to provide the pull and push service. For the moment, all the weights and biases are saved in this single server. And the exchange between server and workers will be implemented by TCP. Hence, we could easily broadcast the IP address and the port number to the workers. And then the workers can send the gradients and retrieve the new parameters via TCP socket. We could also need to implement the synchronisation between workers and parameter server to be able to bring more parameter update strategies, e.g., the stale-synchronous strategy needs a hyperparameter "staleness" to define the waiting interval. The idea is to maintain a vector clock consisting of all workers' clock in the server. Each time when an iteration finishes, the worker will send a request to server and then the server will send back a response to indicate if the worker should wait or not. > Single-node parameter server primitives > --------------------------------------- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task > Reporter: Matthias Boehm > Assignee: LI Guobao > Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > # For the case of local multi-thread parameter server, it is easy to > maintain a concurrent hashmap (where the parameters as value accompanied with > a defined key) inside the CP. And the workers are launched in multi-threaded > way to execute the gradients calculation function and push the gradients to > the hashmap. An another thread will be launched to pull the gradients from > hashmap and call the aggregation function to update the parameters. > # For the case of spark distributed backend, we could launch a remote single > parameter server outside of CP (as a worker) to provide the pull and push > service. For the moment, all the weights and biases are saved in this single > server. And the exchange between server and workers will be implemented by > TCP. Hence, we could easily broadcast the IP address and the port number to > the workers. And then the workers can send the gradients and retrieve the new > parameters via TCP socket. > We could also need to implement the synchronisation between workers and > parameter server to be able to bring more parameter update strategies, e.g., > the stale-synchronous strategy needs a hyperparameter "staleness" to define > the waiting interval. The idea is to maintain a vector clock consisting of > all workers' clock in the server. Each time when an iteration finishes, the > worker will send a request to server and then the server will send back a > response to indicate if the worker should wait or not. > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)