[ https://issues.apache.org/jira/browse/SYSTEMML-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
LI Guobao updated SYSTEMML-2085: -------------------------------- Description: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. The idea is to run a single-node parameter server by maintaining a hashmap inside the CP (Control Program) where the parameter as value accompanied with a defined key. For example, inserting the global parameter with a key named “worker-param-replica” allows the workers to retrieve the parameter replica. Hence, in the context of local multi-threaded backend, workers can communicate directly with this hashmap in the same process. And in the context of Spark distributed backend, the CP firstly needs to fork a thread to start a parameter server which maintains a hashmap. And secondly the workers can send intermediates and retrieve parameters by connecting to parameter server via TCP socket. Since SystemML has good cache management, we only need to maintain the matrix reference pointing to a file location instead of real data instance in the hashmap. If time permits, to be able to introduce the async and staleness update strategies, we would need to implement the synchronization by leveraging vector clock. (was: A single node parameter server acts as a data-parallel parameter server. And a multi-node model parallel parameter server will be discussed if time permits. ) > Single-node parameter server primitives > --------------------------------------- > > Key: SYSTEMML-2085 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2085 > Project: SystemML > Issue Type: Sub-task > Reporter: Matthias Boehm > Assignee: LI Guobao > Priority: Major > > A single node parameter server acts as a data-parallel parameter server. And > a multi-node model parallel parameter server will be discussed if time > permits. The idea is to run a single-node parameter server by maintaining a > hashmap inside the CP (Control Program) where the parameter as value > accompanied with a defined key. For example, inserting the global parameter > with a key named “worker-param-replica” allows the workers to retrieve the > parameter replica. Hence, in the context of local multi-threaded backend, > workers can communicate directly with this hashmap in the same process. And > in the context of Spark distributed backend, the CP firstly needs to fork a > thread to start a parameter server which maintains a hashmap. And secondly > the workers can send intermediates and retrieve parameters by connecting to > parameter server via TCP socket. Since SystemML has good cache management, we > only need to maintain the matrix reference pointing to a file location > instead of real data instance in the hashmap. If time permits, to be able to > introduce the async and staleness update strategies, we would need to > implement the synchronization by leveraging vector clock. -- This message was sent by Atlassian JIRA (v7.6.3#76005)