[
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
LI Guobao updated SYSTEMML-2085:
--------------------------------
Description:
A single node parameter server acts as a data-parallel parameter server. And a
multi-node model parallel parameter server will be discussed if time permits.
Push/Pull service:
In general, we could launch a parameter server inside (local multi-thread
backend) or outside (spark distributed backend) of CP to provide the pull and
push service. For the moment, all the weights and biases are saved in a hashmap
using a key, e.g., "global parameter". Each worker's gradients will be put into
the hashmap seperately with a given key. And the exchange between server and
workers will be implemented by TCP. Hence, we could easily broadcast the IP
address and the port number to the workers. And then the workers can send the
gradients and retrieve the new parameters via TCP socket. The server will also
spawn a thread which retrieves the gradients by polling the hashmap using
relevant keys and aggregates them. At last, it updates the global parameter in
the hashmap.
Synchronization:
We also need to implement the synchronization between workers and parameter
server to be able to bring more parameter update strategies, e.g., the
stale-synchronous strategy needs a hyperparameter "staleness" to define the
waiting interval. The idea is to maintain a vector clock recording all workers'
clock in the server. Each time when an iteration in side of worker finishes, it
waits server to give a signal, i.e., to send a request for calculating the
staleness according to the vector clock. And when the server receives the
gradients from certain worker, it will increment the vector clock for this
worker. So we could define BSP as "staleness==0", ASP as "staleness==-1" and
SSP as "staleness==N".
A diagram of the parameter server architecture is shown below.
was:
A single node parameter server acts as a data-parallel parameter server. And a
multi-node model parallel parameter server will be discussed if time permits.
# For the case of local multi-thread parameter server, it is easy to maintain
a concurrent hashmap (where the parameters as value accompanied with a defined
key) inside the CP. And the workers are launched in multi-threaded way to
execute the gradients calculation function and push the gradients to the
hashmap. An another thread will be launched to pull the gradients from hashmap
and call the aggregation function to update the parameters.
# For the case of spark distributed backend, we could launch a remote single
parameter server outside of CP (as a worker) to provide the pull and push
service. For the moment, all the weights and biases are saved in this single
server. And the exchange between server and workers will be implemented by TCP.
Hence, we could easily broadcast the IP address and the port number to the
workers. And then the workers can send the gradients and retrieve the new
parameters via TCP socket.
We could also need to implement the synchronisation between workers and
parameter server to be able to bring more parameter update strategies, e.g.,
the stale-synchronous strategy needs a hyperparameter "staleness" to define the
waiting interval. The idea is to maintain a vector clock consisting of all
workers' clock in the server. Each time when an iteration finishes, the worker
will send a request to server and then the server will send back a response to
indicate if the worker should wait or not.
A diagram of the parameter server architecture is shown below.
> Single-node parameter server primitives
> ---------------------------------------
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
> Issue Type: Sub-task
> Reporter: Matthias Boehm
> Assignee: LI Guobao
> Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And
> a multi-node model parallel parameter server will be discussed if time
> permits.
> Push/Pull service:
> In general, we could launch a parameter server inside (local multi-thread
> backend) or outside (spark distributed backend) of CP to provide the pull and
> push service. For the moment, all the weights and biases are saved in a
> hashmap using a key, e.g., "global parameter". Each worker's gradients will
> be put into the hashmap seperately with a given key. And the exchange between
> server and workers will be implemented by TCP. Hence, we could easily
> broadcast the IP address and the port number to the workers. And then the
> workers can send the gradients and retrieve the new parameters via TCP
> socket. The server will also spawn a thread which retrieves the gradients by
> polling the hashmap using relevant keys and aggregates them. At last, it
> updates the global parameter in the hashmap.
> Synchronization:
> We also need to implement the synchronization between workers and parameter
> server to be able to bring more parameter update strategies, e.g., the
> stale-synchronous strategy needs a hyperparameter "staleness" to define the
> waiting interval. The idea is to maintain a vector clock recording all
> workers' clock in the server. Each time when an iteration in side of worker
> finishes, it waits server to give a signal, i.e., to send a request for
> calculating the staleness according to the vector clock. And when the server
> receives the gradients from certain worker, it will increment the vector
> clock for this worker. So we could define BSP as "staleness==0", ASP as
> "staleness==-1" and SSP as "staleness==N".
> A diagram of the parameter server architecture is shown below.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)