[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
LI Guobao updated SYSTEMML-2085: -------------------------------- Issue Type: Technical task (was: Sub-task) > Single-node parameter server primitives > --------------------------------------- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Technical task > Reporter: Matthias Boehm > Assignee: LI Guobao > Priority: Major > Attachments: ps.png > > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. > Push/Pull service: > In general, we could launch a parameter server inside (local multi-thread > backend) or outside (spark distributed backend) of CP to provide the pull and > push service. For the moment, all the weights and biases are saved in a > hashmap using a key, e.g., "global parameter". Each worker's gradients will > be put into the hashmap seperately with a given key. And the exchange between > server and workers will be implemented by TCP. Hence, we could easily > broadcast the IP address and the port number to the workers. And then the > workers can send the gradients and retrieve the new parameters via TCP > socket. The server will also spawn a thread which retrieves the gradients by > polling the hashmap using relevant keys and aggregates them. At last, it > updates the global parameter in the hashmap. > Synchronization: > We also need to implement the synchronization between workers and parameter > server to be able to bring more parameter update strategies, e.g., the > stale-synchronous strategy needs a hyperparameter "staleness" to define the > waiting interval. The idea is to maintain a vector clock recording all > workers' clock in the server. Each time when an iteration in side of worker > finishes, it waits server to give a signal, i.e., to send a request for > calculating the staleness according to the vector clock. And when the server > receives the gradients from certain worker, it will increment the vector > clock for this worker. So we could define BSP as "staleness==0", ASP as > "staleness==-1" and SSP as "staleness==N". > A diagram of the parameter server architecture is shown below. -- This message was sent by Atlassian JIRA (v7.6.3#76005)