Yes I agree. Jupyter was great for me because I was experimenting with "model parallelism" and wanted to be able to quickly iterate, but it ends up making the code you write look significantly different from the single process case, and the common case for distributed training is data parallelism.
The annoying part of this workflow is how to distribute the model script. So I think the appropriate strategy is asking for a "training script PVC" that can be mounted read-only on many workers and then the workers just expect to run train.py from the volume. I'm hoping to return to this this weekend. -Eli On Jul 15, 2016 4:08 AM, "Toby Chan" <[email protected]> wrote: > Great. I have looked at this and it works. > > But I'm not sure if it's the good way for user to deploy TensorFlow > application with Jupyter. Sometimes they want to embed the client within > the worker or access the local file system. The Kubernetes yaml file looks > a little complicated and I hope we can integrated with them as an cloud > project. > > Thanks again for all your helps. > > On Friday, July 15, 2016 at 4:51:46 AM UTC+8, Eli Bixby wrote: >> >> Hey Toby, author of that example, I used a little jinja config and render >> script to get 1 RC + 1 Service for each worker, since the workers need to >> be individually addressable, you can adjust the number of workers and >> parameter servers by changing example-cluster.yaml and running the >> render.py script on it. >> >> I'd like to publish a Helm Chart, which I'll later update to use PetSets >> (once they are out of Alpha). >> >> Also, there are some updates I need to push out to that repository: >> namely all of the parameter servers need to have a shared volume mounted RW >> in order for check-pointing to work, which is necessary for training jobs >> to survive pod restarts. >> >> On Thursday, July 14, 2016 at 1:38:31 PM UTC-7, Alex Robinson wrote: >>> >>> There's an example here if you'd like to give it a try: >>> https://github.com/amygdala/tensorflow-workshop/tree/master/workshop_sections/distributed_tensorflow >>> >>> On Wednesday, July 13, 2016 at 7:16:17 PM UTC-7, Toby Chan wrote: >>>> >>>> Thanks for all your replies. There is a template from David's links and >>>> I would like to try it out. Most blogs about Kubernetes and TensorFlow are >>>> about dockerized TensorFlow serving which is different from TensorFlow >>>> itself. >>>> >>>> And I update this topic later if I find the proper way to configure >>>> cluster spec for distributed TensorFlow with template or something else. >>>> Thanks again. >>>> >>>> On Thursday, July 14, 2016 at 3:51:43 AM UTC+8, David Oppenheimer wrote: >>>>> >>>>> Looks like this is the blog post >>>>> >>>>> http://blog.kubernetes.io/2016/03/scaling-neural-network-image-classification-using-Kubernetes-with-TensorFlow-Serving.html >>>>> >>>>> A Google search turned up a few other seemingly relevant links >>>>> >>>>> http://stackoverflow.com/questions/37720799/how-do-you-run-distributed-tensorflow-on-gke >>>>> >>>>> https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/README.md >>>>> >>>>> https://tensorflow.github.io/serving/serving_inception.html#part_2_deploy_in_kubernetes >>>>> >>>>> >>>>> On Wed, Jul 13, 2016 at 6:20 AM, Rodrigo Campos <[email protected]> >>>>> wrote: >>>>> >>>>>> You can use configmap for that, or even just define env values in the >>>>>> pod spec. >>>>>> >>>>>> Also, I think (I might be wrong) that one of the blog posts about >>>>>> this had examples. But don't have them handy now, probably they were >>>>>> posted >>>>>> in Google cloud platform blog. >>>>>> >>>>>> >>>>>> On Wednesday, July 13, 2016, Toby Chan <[email protected]> wrote: >>>>>> >>>>>>> We have found that Google recommend Kubernetes to run distributed >>>>>>> TensorFlow. The container should be immutable, but how can we define the >>>>>>> configuration? For TensorFlow, we need to specified the cluster_spec by >>>>>>> parameters or environment variables. >>>>>>> >>>>>>> What's the recommended way to run TensorFlow in Kubernetes? Do >>>>>>> anyone have experience or example for this? >>>>>>> >>>>>>> >>>>>>> Thanks in advance. Regards. >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "Containers at Google" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to >>>>>>> [email protected]. >>>>>>> Visit this group at >>>>>>> https://groups.google.com/group/google-containers. >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Containers at Google" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at https://groups.google.com/group/google-containers >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- You received this message because you are subscribed to the Google Groups "Containers at Google" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/google-containers. For more options, visit https://groups.google.com/d/optout.
