[
https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
LI Guobao updated SYSTEMML-2085:
--------------------------------
Description:
A single node parameter server acts as a data-parallel parameter server. And a
multi-node model parallel parameter server will be discussed if time permits.
# For the case of local multi-thread parameter server, it is easy to maintain
a concurrent hashmap (where the parameters as value accompanied with a defined
key) inside the CP. And the workers are launched in multi-threaded way to
execute the gradients calculation function and push the gradients to the
hashmap. An another thread will be launched to pull the gradients from hashmap
and call the aggregation function to update the parameters.
# For the case of spark distributed backend, we could launch a remote single
parameter server outside of CP (as a worker) to provide the pull and push
service. For the moment, all the weights and biases are saved in this single
server. And the exchange between server and workers will be implemented by TCP.
Hence, we could easily broadcast the IP address and the port number to the
workers. And then the workers can send the gradients and retrieve the new
parameters via TCP socket.
We could also need to implement the synchronisation between workers and
parameter server to be able to bring more parameter update strategies, e.g.,
the stale-synchronous strategy needs a hyperparameter "staleness" to define the
waiting interval. The idea is to maintain a vector clock consisting of all
workers' clock in the server. Each time when an iteration finishes, the worker
will send a request to server and then the server will send back a response to
indicate if the worker should wait or not.
was:A single node parameter server acts as a data-parallel parameter server.
And a multi-node model parallel parameter server will be discussed if time
permits. The idea is to run a single-node parameter server by maintaining a
hashmap inside the CP (Control Program) where the parameter as value
accompanied with a defined key. For example, inserting the global parameter
with a key named “worker-param-replica” allows the workers to retrieve the
parameter replica. Hence, in the context of local multi-threaded backend,
workers can communicate directly with this hashmap in the same process. And in
the context of Spark distributed backend, the CP firstly needs to fork a thread
to start a parameter server which maintains a hashmap. And secondly the workers
can send intermediates and retrieve parameters by connecting to parameter
server via TCP socket. Since SystemML has good cache management, we only need
to maintain the matrix reference pointing to a file location instead of real
data instance in the hashmap. If time permits, to be able to introduce the
async and staleness update strategies, we would need to implement the
synchronization by leveraging vector clock.
> Single-node parameter server primitives
> ---------------------------------------
>
> Key: SYSTEMML-2085
> URL: https://issues.apache.org/jira/browse/SYSTEMML-2085
> Project: SystemML
> Issue Type: Sub-task
> Reporter: Matthias Boehm
> Assignee: LI Guobao
> Priority: Major
> Attachments: ps.png
>
>
> A single node parameter server acts as a data-parallel parameter server. And
> a multi-node model parallel parameter server will be discussed if time
> permits.
> # For the case of local multi-thread parameter server, it is easy to
> maintain a concurrent hashmap (where the parameters as value accompanied with
> a defined key) inside the CP. And the workers are launched in multi-threaded
> way to execute the gradients calculation function and push the gradients to
> the hashmap. An another thread will be launched to pull the gradients from
> hashmap and call the aggregation function to update the parameters.
> # For the case of spark distributed backend, we could launch a remote single
> parameter server outside of CP (as a worker) to provide the pull and push
> service. For the moment, all the weights and biases are saved in this single
> server. And the exchange between server and workers will be implemented by
> TCP. Hence, we could easily broadcast the IP address and the port number to
> the workers. And then the workers can send the gradients and retrieve the new
> parameters via TCP socket.
> We could also need to implement the synchronisation between workers and
> parameter server to be able to bring more parameter update strategies, e.g.,
> the stale-synchronous strategy needs a hyperparameter "staleness" to define
> the waiting interval. The idea is to maintain a vector clock consisting of
> all workers' clock in the server. Each time when an iteration finishes, the
> worker will send a request to server and then the server will send back a
> response to indicate if the worker should wait or not.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)