[ 
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI Guobao updated SYSTEMML-2085:
--------------------------------
    Description: 
A single node parameter server acts as a data-parallel parameter server. And a 
multi-node model parallel parameter server will be discussed if time permits. 
 # For the case of local multi-thread parameter server, it is easy to maintain 
a concurrent hashmap (where the parameters as value accompanied with a defined 
key) inside the CP. And the workers are launched in multi-threaded way to 
execute the gradients calculation function and push the gradients to the 
hashmap. An another thread will be launched to pull the gradients from hashmap 
and call the aggregation function to update the parameters. 
 # For the case of spark distributed backend, we could launch a remote single 
parameter server outside of CP (as a worker) to provide the pull and push 
service. For the moment, all the weights and biases are saved in this single 
server. And the exchange between server and workers will be implemented by TCP. 
Hence, we could easily broadcast the IP address and the port number to the 
workers. And then the workers can send the gradients and retrieve the new 
parameters via TCP socket. 

We could also need to implement the synchronisation between workers and 
parameter server to be able to bring more parameter update strategies, e.g., 
the stale-synchronous strategy needs a hyperparameter "staleness" to define the 
waiting interval. The idea is to maintain a vector clock consisting of all 
workers' clock in the server. Each time when an iteration finishes, the worker 
will send a request to server and then the server will send back a response to 
indicate if the worker should wait or not.

  was:A single node parameter server acts as a data-parallel parameter server. 
And a multi-node model parallel parameter server will be discussed if time 
permits. The idea is to run a single-node parameter server by maintaining a 
hashmap inside the CP (Control Program) where the parameter as value 
accompanied with a defined key. For example, inserting the global parameter 
with a key named “worker-param-replica” allows the workers to retrieve the 
parameter replica. Hence, in the context of local multi-threaded backend, 
workers can communicate directly with this hashmap in the same process. And in 
the context of Spark distributed backend, the CP firstly needs to fork a thread 
to start a parameter server which maintains a hashmap. And secondly the workers 
can send intermediates and retrieve parameters by connecting to parameter 
server via TCP socket. Since SystemML has good cache management, we only need 
to maintain the matrix reference pointing to a file location instead of real 
data instance in the hashmap. If time permits, to be able to introduce the 
async and staleness update strategies, we would need to implement the 
synchronization by leveraging vector clock.


> Single-node parameter server primitives
> ---------------------------------------
>
>                 Key: SYSTEMML-2085
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
>             Project: SystemML
>          Issue Type: Sub-task
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>         Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And 
> a multi-node model parallel parameter server will be discussed if time 
> permits. 
>  # For the case of local multi-thread parameter server, it is easy to 
> maintain a concurrent hashmap (where the parameters as value accompanied with 
> a defined key) inside the CP. And the workers are launched in multi-threaded 
> way to execute the gradients calculation function and push the gradients to 
> the hashmap. An another thread will be launched to pull the gradients from 
> hashmap and call the aggregation function to update the parameters. 
>  # For the case of spark distributed backend, we could launch a remote single 
> parameter server outside of CP (as a worker) to provide the pull and push 
> service. For the moment, all the weights and biases are saved in this single 
> server. And the exchange between server and workers will be implemented by 
> TCP. Hence, we could easily broadcast the IP address and the port number to 
> the workers. And then the workers can send the gradients and retrieve the new 
> parameters via TCP socket. 
> We could also need to implement the synchronisation between workers and 
> parameter server to be able to bring more parameter update strategies, e.g., 
> the stale-synchronous strategy needs a hyperparameter "staleness" to define 
> the waiting interval. The idea is to maintain a vector clock consisting of 
> all workers' clock in the server. Each time when an iteration finishes, the 
> worker will send a request to server and then the server will send back a 
> response to indicate if the worker should wait or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to