Well, at a high-level, we could emulate synchronous model parallelism via our existing parfor construct out of the box. If this is sufficient from an algorithm perspective, I would be in favor of making any necessary improvements there instead of introducing a new construct for parameter servers.
There are a couple of reasons for that. First, given the variety of backends and potential execution plans, it's usually hard work to integrate such a construct well with the rest of the system. Second, a custom parameter server would need to be either integrated with Spark, or (if implemented from scratch) with a number of different cluster resource managers (e.g., YARN, Mesos, Kubernetes, etc). Third, extending the existing parfor construct as necessary would potentially also benefit other scripts. Asynchronous model parallelism might also be possible to integrate into parfor. I remember discussions on state exchange between parfor workers (e.g., for KMeans to find out if at least one run converged already). Maybe this is a good time to introduce this, which would allow the update and broadcast of models in this context. Regards, Matthias On Sun, Jun 18, 2017 at 10:16 PM, Janardhan Pulivarthi < [email protected]> wrote: > Dear committers, > > Implementation/Integration of existing parameter server for the execution > of algorithms in a distributed fashion both for the machine learning and > deep learning. > > The following document covers a bit about whether we need one or not ?. > > My name is Janardhan, currently working on [SYSTEMML-1437] implementation > of factorization machines, which are to be sparse-safe and scalable, to > stick to this philosophy we might need a model parallel construct. I know > very little about how systemml exactly works. If you find some *7 minutes* > please have a look at this doc. > > Parameter Server: a model parallel construct > <https://docs.google.com/document/d/1AOW53numMJSF_msGvo1lekpyv7_ > 3VF51i6xAjNCEC9I/edit?usp=drive_web> > >
