[ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:
--------------------------------
    Description: 
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 
 # For the case of local multi-thread parameter server, it is easy to maintain 
a concurrent hashmap (where the parameters as value accompanied with a defined 
key) inside the CP. And the workers are launched in multi-threaded way to 
execute the gradients calculation function and push the gradients to the 
hashmap. An another thread will be launched to pull the gradients from hashmap 
and call the aggregation function to update the parameters. 
 # For the case of spark distributed backend, we could launch a remote single 
parameter server outside of CP (as a worker) to provide the pull and push 
service. For the moment, all the weights and biases are saved in this single 
server. And the exchange between server and workers will be implemented by TCP. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via TCP socket. 

We could also need to implement the synchronisation between workers and 
parameter server to be able to bring more parameter update strategies, e.g., 
the stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock consisting of all 
workers' clock in the server. Each time when an iteration finishes, the worker 
will send a request to server and then the server will send back a response to 
indicate if the worker should wait or not.

A diagram of the parameter server architecture is shown below.

  was:
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 
 # For the case of local multi-thread parameter server, it is easy to maintain 
a concurrent hashmap (where the parameters as value accompanied with a defined 
key) inside the CP. And the workers are launched in multi-threaded way to 
execute the gradients calculation function and push the gradients to the 
hashmap. An another thread will be launched to pull the gradients from hashmap 
and call the aggregation function to update the parameters. 
 # For the case of spark distributed backend, we could launch a remote single 
parameter server outside of CP (as a worker) to provide the pull and push 
service. For the moment, all the weights and biases are saved in this single 
server. And the exchange between server and workers will be implemented by TCP. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via TCP socket. 

We could also need to implement the synchronisation between workers and 
parameter server to be able to bring more parameter update strategies, e.g., 
the stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock consisting of all 
workers' clock in the server. Each time when an iteration finishes, the worker 
will send a request to server and then the server will send back a response to 
indicate if the worker should wait or not.


> Single-node parameter server primitives
> ---------------------------------------
>
>                 Key: SYSTEMML-2085
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>         Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
>  # For the case of local multi-thread parameter server, it is easy to 
> maintain a concurrent hashmap (where the parameters as value accompanied with 
> a defined key) inside the CP. And the workers are launched in multi-threaded 
> way to execute the gradients calculation function and push the gradients to 
> the hashmap. An another thread will be launched to pull the gradients from 
> hashmap and call the aggregation function to update the parameters. 
>  # For the case of spark distributed backend, we could launch a remote single 
> parameter server outside of CP (as a worker) to provide the pull and push 
> service. For the moment, all the weights and biases are saved in this single 
> server. And the exchange between server and workers will be implemented by 
> TCP. Hence, we could easily broadcast the IP address and the port number to 
> the workers. And then the workers can send the gradients and retrieve the new 
> parameters via TCP socket. 
> We could also need to implement the synchronisation between workers and 
> parameter server to be able to bring more parameter update strategies, e.g., 
> the stale-synchronous strategy needs a hyperparameter "staleness" to define 
> the waiting interval. The idea is to maintain a vector clock consisting of 
> all workers' clock in the server. Each time when an iteration finishes, the 
> worker will send a request to server and then the server will send back a 
> response to indicate if the worker should wait or not.
> A diagram of the parameter server architecture is shown below.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to