Github user sachingoel0101 commented on the pull request:
https://github.com/apache/flink/pull/1003#issuecomment-131477326
A stand-alone parameter server service will require setting up and tearing
down of the client every time the user, say, opens and closes a Rich function
while using it. Further, it means we have to add another dependency when the
same could be accomplished using akka.
In this implementation, the parameter *server* is at every task manager
[effectively it acts as a client to serve all running tasks at one node. In
fact, in this sense, there are no servers, just clients at every worker, which
are managed by the Job Manager]. This in itself means lesser data transfer over
the network, since every *server* will usually be owner of a key and can serve
its *clients* faster instead of every request going over the network.
Further, it is completely distributed, and every task manager maintains its
own *server* and sets it up or tears it down along-with itself.
As far as including it in the core itself is concerned, there isn't much of
it. There are the 3-4 odd functions directly added in the Runtime context,
which effectively serve as an interface.
@tillrohrmann, could you weigh in here if this is the intended use of a PS
in ML algorithms. I can easily see this working with, for example, the
regression algorithm.
The reasons I included it into the runtime is that, there will be no
chances of failure now. If the TaskManager is alive, the Parameter Server at
that client will be alive. Further, the Job Manager manages the servers and
determines where each key will go [which will be crucial to recovery],
something which can be very hard to determine in a completely de-centralized
manner (I couldn't think of a full-proof way). This ensures that the server is
running only on the workers where it's needed, and if it is needed. Keeping the
Job Manager in the loop also ensures that recovery is easy. If a Task Manager
fails, the Job Manager knows which server failed by matching the `InstanceID`s
and can kick off the recovery process from the duplicate server. [This is not
implemented yet.]
A stand-alone PS will add another master-node system in parallel to the
JobManager-TaskManager system, which can be efficiently used for this purpose.
Of course, this doesn't matter if we use an external key-value store.
I will have a look at #967 and see how the two can be integrated.
I had a look at an open implementation done for Spark.
https://github.com/apache/spark/compare/branch-1.3...chouqin:ps-on-spark-1.3
This adds a separate context and a function on RDD to access the PS and
does require running a service inside the core environment.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---