Forgot to add Xun in my last email. On Thu, Nov 8, 2018 at 11:55 AM Wangda Tan <wheele...@gmail.com> wrote:
> Hi Robert, > > Submarine in 3.2.0 only support Docker container runtime, and in future > releases (maybe 3.2.1), we plan to add support for non-docker containers. > > In order to try Submarine, you need to properly configure docker-on-yarn > first. > > You can check > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md > for installation guide about how to properly setup Docker container on > multiple containers. Submarine embedded an interactive shell to help you > set up this should be straightforward. Added Xun Liu who is the original > author for the installation interactive shell. > > Once you get Docker on YARN properly set up, you can follow > https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/QuickStart.md > to run the first application. > > Also, you can check Submarine slides to better understand how it works. > See: https://www.dropbox.com/s/wuv19b3rt9k2kq6/submarine-v0.pptx?dl=0 > > Any questions please don't hesitate to let us know. > > Thanks, > Wangda > > > > On Thu, Nov 8, 2018 at 10:12 AM Robert Grandl <rgra...@yahoo.com.invalid> > wrote: > >> Thanks a lot for your reply. >> Sunil, >> I was trying to follow the steps from: >> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md >> >> to run the tensorflow standalone using submarine. I have installed hadoop >> 3.3.0-SNAPSHOT. >> However, when I run the:yarn jar >> path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \ >> job run --name tf-job-001 --verbose --docker_image >> hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \ >> --input_path hdfs://default/dataset/cifar-10-data \ >> --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ >> --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 >> --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \ >> --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator >> && python cifar10_main.py --data-dir=%input_path% >> --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 >> --train-batch-size=16 --num-gpus=2 --sync" \ >> --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3 >> command, I get the following error:2018-11-07 21:48:55,831 INFO [main] >> client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to >> Application History server at /128.105.144.236:10200Exception in thread >> "main" java.lang.IllegalArgumentException: Unacceptable no of cpus >> specified, either zero or negative for component master (or at the global >> level) at >> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457) >> at >> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306) >> at >> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237) >> at >> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496) >> at >> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) >> at >> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) >> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:498) at >> org.apache.hadoop.util.RunJar.run(RunJar.java:323) at >> org.apache.hadoop.util.RunJar.main(RunJar.java:236) >> >> It seems that I don't configure somewhere some corresponding resources >> for a master component. However I have a hard time understanding where and >> what to configure. I also looked at the design document you pointed at: >> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7 >> >> and it has a --master_resources flag. However this is not available in >> 3.3.0. >> Could you please advise how to proceed with this? >> Thank you,- Robert >> >> On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung < >> jyhung2...@gmail.com> wrote: >> >> Hi Robert, I also encourage you to check out >> https://github.com/linkedin/TonY (TensorFlow on YARN) which is a >> platform built for this purpose. >> >> Jonathan >> ________________________________ >> From: Sunil G <sun...@apache.org> >> Sent: Tuesday, November 6, 2018 10:05:14 PM >> To: Robert Grandl >> Cc: yarn-...@hadoop.apache.org; yarn-dev-h...@hadoop.apache.org; General >> Subject: Re: Run Distributed TensorFlow on YARN >> >> Hi Robert >> >> {Submarine} project helps to run Distributed Tensorflow on top of YARN >> with >> ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an >> early attempt to do the same with some scripts etc, but Submarine will >> help >> to avoid all such custom scripts etc, and rather can simply run tensorflow >> like a distributed shell command line by using Submarine jar. Pls refer >> below doc for deep dive. >> >> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7 >> >> Submarine will be released as part of Hadoop 3.2.0 release which will be >> out very soon officially (in coming weeks). you are free to use hadoop >> trunk to run same if you need very soon. >> >> For now you can refer submarine docs under hadoop repo (trunk) >> under >> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/ >> or( >> >> https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown >> ) >> >> Thanks >> Sunil >> >> >> On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rgra...@yahoo.com.invalid> >> wrote: >> >> > Hi all, >> > I am wondering if there is any stable support to run distributed >> > TensorFlow atop YARN at the moment. >> > I found this blog post from Hortonworks. It seems this it is possible >> > starting YARN 3.1.0. >> > >> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/ >> > >> > >> > Also I found some more recent JIRAs: >> > https://issues.apache.org/jira/browse/YARN-8220 >> > https://issues.apache.org/jira/browse/YARN-8135 >> > which suggests to use something called submarine. >> > >> > However, I could not find any proper documentation or instructions to >> use >> > any of these. >> > >> > Can someone help me with this? >> > Otherwise, it is any better support to run any other machine learning >> > framework with YARN? >> > Thank you in advance,- Robert >> > > >