Very interesting addition, Sachin! I cc-ed Nam-Luc Tran, who recently opened a pull request with a draft for SSP iterations in Flink, which usually work closely together with a parameter server.
I think that a parameter server is an orthogonal piece to Flink, and should be decoupled from the runtime. But it would be great to have some common tooling and abstraction for this: - Sachin's parameter server is built on Akka - Nam-Luc used Apache Ignite for distributed Key/Value storage - There are several other dedicated parameter server projects (such as https://github.com/dmlc/parameter_server) How about creating a common abstraction with an interface that supports startup of the distributed model store, get/update, staleness, ... ? The different technologies would then be pluggable behind this interface. Since this is largely decoupled from Flink, and probably involves a few people, it might even make sense to create a dedicated GitHub project, and later add it as a whole to Flink (or keep it independent, what ever works better). What do you think? Greetings, Stephan On Sat, Aug 8, 2015 at 4:04 AM, Sachin Goel <sachingoel0...@gmail.com> wrote: > Hi all > I've been working on a Parameter Server for Flink and I have some basic > functionality up, which supports registering a key, plus setting, updating > and fetching a parameter. > > The idea is to start a Parameter Server somewhere and pass the address and > port in a configuration for another actor system to access it. Right now, I > have a standalone module for this, named flink-server under staging. > > There is a {{ParameterClient}} which allows users to do the above > operations in a blocking fashion by waiting on a result from the server. > You can look at the code here: > > https://github.com/apache/flink/compare/master...sachingoel0101:parameter_server > [It is highly derived from the JobManager implementation.] > > One obvious thing to do is to ensure there are several servers which can > serve data to users. This can help achieve redundancy too by copying data > over several servers and keeping them synchronized. > > 1. We can follow a slave model where starting a server anywhere starts a > server on all slave machines. After this, I plan to copy a key-value pair > on several machines by computing their hashes [key's and server's UUID's] > modulo #servers. This way every server knows where exactly all the keys are > residing. This however has a problem at the time of failures. > If a server fails, we need to recompute the modulo values and re-distribute > almost all of the data to maintain redundancy. > > 2. Another method is, for every task manager started, inside the same > system, one server should be started and this server will handle all data > transfers from the tasks running inside the particular TaskManager. This > way, whenever there is a failure of a machine, the JobManager at least > knows and can let other TaskManagers and their servers know of the failure > of their fellow server. > Since the Job Manager is maintaining a list of the servers/task-managers, > we can maintain a indexed list of servers very easily. Then it's just a > matter of mapping a key to an index in the JobManager's instance list. > [Of course, I'm assuming it would be hard to assign indexes to servers in a > standalone fashion such that everyone has the exact same view]. > > I'm more in favor of 2, since we after all need to utilize this in > iterative algorithms, and that will need integration into task manager and > runtime context anyways. Plus, having a master to control everything makes > everything easy. :') > > What do you guys think? > > Cheers! > Sachin > > PS. Sorry about the long email on a weekend. > > -- Sachin Goel > Computer Science, IIT Delhi > m. +91-9871457685 >