Github user sachingoel0101 commented on the pull request:

    https://github.com/apache/flink/pull/1003#issuecomment-131477326
  
    A stand-alone parameter server service will require setting up and tearing 
down of the client every time the user, say, opens and closes a Rich function 
while using it. Further, it means we have to add another dependency when the 
same could be accomplished using akka.
    In this implementation, the parameter *server* is at every task manager 
[effectively it acts as a client to serve all running tasks at one node. In 
fact, in this sense, there are no servers, just clients at every worker, which 
are managed by the Job Manager]. This in itself means lesser data transfer over 
the network, since every *server* will usually be owner of a key and can serve 
its *clients* faster instead of every request going over the network.
    Further, it is completely distributed, and every task manager maintains its 
own *server* and sets it up or tears it down along-with itself.
    As far as including it in the core itself is concerned, there isn't much of 
it. There are the 3-4 odd functions directly added in the Runtime context, 
which effectively serve as an interface.
    @tillrohrmann, could you weigh in here if this is the intended use of a PS 
in ML algorithms. I can easily see this working with, for example, the 
regression algorithm.
    The reasons I included it into the runtime is that, there will be no 
chances of failure now. If the TaskManager is alive, the Parameter Server at 
that client will be alive. Further, the Job Manager manages the servers and 
determines where each key will go [which will be crucial to recovery], 
something which can be very hard to determine in a completely de-centralized 
manner (I couldn't think of a full-proof way). This ensures that the server is 
running only on the workers where it's needed, and if it is needed. Keeping the 
Job Manager in the loop also ensures that recovery is easy. If a Task Manager 
fails, the Job Manager knows which server failed by matching the `InstanceID`s 
and can kick off the recovery process from the duplicate server. [This is not 
implemented yet.]
    A stand-alone PS will add another master-node system in parallel to the 
JobManager-TaskManager system, which can be efficiently used for this purpose. 
Of course, this doesn't matter if we use an external key-value store.
    I will have a look at #967 and see how the two can be integrated.
    
    I had a look at an open implementation done for Spark. 
https://github.com/apache/spark/compare/branch-1.3...chouqin:ps-on-spark-1.3
    This adds a separate context and a function on RDD to access the PS and 
does require running a service inside the core environment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to