GitHub user sachingoel0101 opened a pull request:
https://github.com/apache/flink/pull/1003
Parameter Server: Distributed Key-Value store, using Akka messaging
This PR adds a basic parameter server implementation built using Akka
messaging.
*Interface:*
Access: Inside `RuntimeContext`
Operations:
1. `registerBatch(Key, Value)` and `registerAsync(Key, Value)` where key
and value signify the Key-Value pair.
2. `updateParameter(key, value)`
3. `fetchParameter(key)`
`Batch` strategy implies that all updates will be done after all registered
clients have sent their updates.
`Async` strategy implies any updates will be immediately applied.
[I intend to support the Stale synchronous strategy too. But I want to get
feedback on the overall design of the work so far.]
Messaging and Actor systems:
1. A server is started within the actor system of every task manager.
2. All servers register with the Job Manager.
3. Job Manager determines on which server a particular key-value pair will
be stored and lets every registered server know as part of a regular heartbeat
message.
4. `RuntimeContext` employs a `Patterns.ask` and `Await` with infinite
timeout to query the server on the same task manager.
5. Servers forward their client's requests to the appropriate servers based
on the key value.
6. After performing the desired operations, results are sent back to the
originating server which forwards them back to the client.
* There is a scope for an implementation of fault-tolerance in this by
maintaining two servers for every key-value pair.
* In case a task has to restart, they can start from a different iteration
number(by accessing a clock value), not having to send all earlier updates,
because all the updates are applied to the Value guaranteedly.
* There are two examples demonstrating the usage of Batch and Async
strategies in Java examples package.
* The travis build doesn't pass yet since there are some new messages from
`ParameterServer` which need to be handled in the unit tests.
I would love feedback about the overall design.
This is mostly inspired from a Machine learning Parameter server
perspective, and doesn't touch the actual internals of Iterative tasks.
Instead, all operations on the Parameter Server are blocking, and have an
infinite timeout. However, there is an intrinsic failure involved with a
maximum number of re-tries, in case the operations fail to succeed, inside the
Server itself.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sachingoel0101/flink flink_server
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1003.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1003
----
commit 539319a30e8d2d841bf1deb1b78de4c4f404dfb1
Author: Sachin Goel <[email protected]>
Date: 2015-08-05T04:04:38Z
Reorganized the server framework to runtime itself. Added asynchronous
blocking calls to the server.
commit 06daf1630afba005401826a68a646859383f4f9c
Author: Sachin Goel <[email protected]>
Date: 2015-08-09T00:29:30Z
Integrating into the Flink Runtime
commit 22933416e57cd63331a93134f56c44bea0395b24
Author: Sachin Goel <[email protected]>
Date: 2015-08-10T01:25:54Z
Added a server for each task manager
commit f3f524b9ecfbeba876c8a5946ac3b491df1b0b6b
Author: Sachin Goel <[email protected]>
Date: 2015-08-10T02:32:32Z
Added JobManager as a manager of servers
commit 988309ac4bbfb928b32b08fcecfab2e03d3a3feb
Author: Sachin Goel <[email protected]>
Date: 2015-08-10T04:15:08Z
Added a parameter store
commit 51c31078482e7c77de8beb90030094ea81a95384
Author: Sachin Goel <[email protected]>
Date: 2015-08-10T10:32:10Z
Added examples to demonstrate the usage
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---