This is an automated email from the ASF dual-hosted git repository.
sunilg pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/hadoop.git
The following commit(s) were added to refs/heads/trunk by this push:
new de01422 SUBMARINE-56. Update documentation to describe single-node
PyTorch integration. Contributed by Szilard Nemeth.
de01422 is described below
commit de01422c2e13747686028e5c0e3a06306e9e5f08
Author: Sunil G <[email protected]>
AuthorDate: Wed May 15 21:26:48 2019 -0700
SUBMARINE-56. Update documentation to describe single-node PyTorch
integration. Contributed by Szilard Nemeth.
---
.../src/site/markdown/DeveloperGuide.md | 2 +-
.../src/site/markdown/Examples.md | 2 +
.../src/site/markdown/Index.md | 6 +-
.../src/site/markdown/InstallationGuide.md | 2 +-
.../markdown/InstallationGuideChineseVersion.md | 2 +-
.../src/site/markdown/QuickStart.md | 13 ++-
.../markdown/RunningDistributedCifar10TFJobs.md | 8 +-
.../markdown/RunningSingleNodeCifar10PTJobs.md | 62 +++++++++++
.../src/site/markdown/WriteDockerfilePT.md | 114 +++++++++++++++++++++
.../{WriteDockerfile.md => WriteDockerfileTF.md} | 8 +-
10 files changed, 203 insertions(+), 16 deletions(-)
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/DeveloperGuide.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/DeveloperGuide.md
index 76e3ae0..9ab0641 100644
--- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/DeveloperGuide.md
+++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/DeveloperGuide.md
@@ -14,7 +14,7 @@
# Developer Guide
-By default, submarine uses YARN service framework as runtime. If you want to
add your own implementation. You can add a new `RuntimeFactory` implementation
and configure following option to `submarine.xml` (which should be placed under
same `$HADOOP_CONF_DIR`)
+By default, Submarine uses YARN service framework as runtime. If you want to
add your own implementation, you can add a new `RuntimeFactory` implementation
and configure following option to `submarine.xml` (which should be placed under
same `$HADOOP_CONF_DIR`)
```
<property>
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Examples.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Examples.md
index 3e7f02f..d878add 100644
--- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Examples.md
+++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Examples.md
@@ -18,4 +18,6 @@ Here're some examples about Submarine usage.
[Running Distributed CIFAR 10 Tensorflow
Job](RunningDistributedCifar10TFJobs.html)
+[Running Standalone CIFAR 10 PyTorch Job](RunningSingleNodeCifar10PTJobs.html)
+
[Running Zeppelin Notebook on YARN](RunningZeppelinOnYARN.html)
\ No newline at end of file
diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Index.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Index.md
index baeaa15..f8556a6 100644
--- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Index.md
+++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Index.md
@@ -22,6 +22,8 @@ Goals of Submarine:
- Support run distributed Tensorflow jobs with simple configs.
+- Support run standalone PyTorch jobs with simple configs.
+
- Support run user-specified Docker images.
- Support specify GPU and other resources.
@@ -37,7 +39,9 @@ Click below contents if you want to understand more.
- [Examples](Examples.html)
-- [How to write Dockerfile for Submarine jobs](WriteDockerfile.html)
+- [How to write Dockerfile for Submarine TensorFlow
jobs](WriteDockerfileTF.html)
+
+- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html)
- [Developer guide](DeveloperGuide.html)
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuide.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuide.md
index 1c7812b..e73887e 100644
---
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuide.md
+++
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuide.md
@@ -304,7 +304,7 @@ https://github.com/NVIDIA/nvidia-docker
### Tensorflow Image
-There is no need to install CUDNN and CUDA on the servers, because CUDNN and
CUDA can be added in the docker images. we can get basic docker images by
referring to WriteDockerfile.md.
+There is no need to install CUDNN and CUDA on the servers, because CUDNN and
CUDA can be added in the docker images. We can get basic docker images by
referring to [Write Dockerfile](WriteDockerfileTF.html).
### Test tensorflow in a docker container
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuideChineseVersion.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuideChineseVersion.md
index ba996e8..7667c1c 100644
---
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuideChineseVersion.md
+++
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuideChineseVersion.md
@@ -293,7 +293,7 @@ https://github.com/NVIDIA/nvidia-docker
### Tensorflow Image
-CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Sumbmarine 中提供了已经包含了CUDNN 和 CUDA
的镜像文件,基础的Dockfile可参见WriteDockerfile.md
+CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Submarine 中提供了已经包含了CUDNN 和 CUDA
的镜像文件,基础的Dockfile可参见WriteDockerfileTF.md
### 测试 TF 环境
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/QuickStart.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/QuickStart.md
index 3b68f51..5648c11 100644
--- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/QuickStart.md
+++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/QuickStart.md
@@ -24,15 +24,18 @@ Optional:
- Enable YARN DNS. (When yarn service runtime is required.)
- Enable GPU on YARN support. (When GPU-based training is required.)
-- Docker images for submarine jobs. (When docker container is required.)
+- Docker images for Submarine jobs. (When docker container is required.)
```
# Get prebuilt docker images (No liability)
docker pull hadoopsubmarine/tf-1.13.1-gpu:0.0.1
# Or build your own docker images
docker build . -f Dockerfile.gpu.tf_1.13.1 -t tf-1.13.1-gpu-base:0.0.1
```
-More details, please refer to
-[How to write Dockerfile for Submarine jobs](WriteDockerfile.html)
+For more details, please refer to:
+
+- [How to write Dockerfile for Submarine TensorFlow
jobs](WriteDockerfileTF.html)
+
+- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html)
## Run jobs
@@ -120,7 +123,7 @@ reported from `entry_script.py`.
### Submarine Configuration
-For submarine internal configuration, please create a `submarine.xml` which
should be placed under `$HADOOP_CONF_DIR`.
+For Submarine internal configuration, please create a `submarine.xml` which
should be placed under `$HADOOP_CONF_DIR`.
|Configuration Name | Description |
|:---- |:---- |
@@ -235,7 +238,7 @@ Or you can use `yarn logs -applicationId <applicationId>`
to get logs from CLI
## Build from source code
-If you want to build submarine project by yourself, you can follow the steps:
+If you want to build the Submarine project by yourself, you can follow the
steps:
- Run 'mvn install -DskipTests' from Hadoop source top level once.
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningDistributedCifar10TFJobs.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningDistributedCifar10TFJobs.md
index 7da98d5..c0cf088 100644
---
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningDistributedCifar10TFJobs.md
+++
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningDistributedCifar10TFJobs.md
@@ -39,9 +39,9 @@ python generate_cifar10_tfrecords.py --data-dir=cifar-10-data
hadoop fs -put cifar-10-data/ /dataset/cifar-10-data
```
-**Please note that:**
+**Warning:**
-YARN service doesn't allow multiple services with the same name, so please run
following command
+Please note that YARN service doesn't allow multiple services with the same
name, so please run following command
```
yarn application -destroy <service-name>
```
@@ -49,7 +49,7 @@ to delete services if you want to reuse the same service name.
## Prepare Docker images
-Refer to [Write Dockerfile](WriteDockerfile.md) to build a Docker image or use
prebuilt one.
+Refer to [Write Dockerfile](WriteDockerfileTF.html) to build a Docker image or
use prebuilt one.
## Run Tensorflow jobs
@@ -92,6 +92,8 @@ Explanations:
- `>1` num_workers indicates it is a distributed training.
- Parameters / resources / Docker image of parameter server can be specified
separately. For many cases, parameter server doesn't require GPU.
+For the meaning of the individual parameters, see the
[QuickStart](QuickStart.html) page!
+
*Outputs of distributed training*
Sample output of master:
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningSingleNodeCifar10PTJobs.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningSingleNodeCifar10PTJobs.md
new file mode 100644
index 0000000..ca77c82
--- /dev/null
+++
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningSingleNodeCifar10PTJobs.md
@@ -0,0 +1,62 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# Tutorial: Running a standalone Cifar10 PyTorch Estimator Example.
+
+Currently, PyTorch integration with Submarine only supports PyTorch in
standalone (non-distributed mode).
+Please also note that HDFS as a data source is not yet supported by PyTorch.
+
+## What is CIFAR-10?
+CIFAR-10 is a common benchmark in machine learning for image recognition.
Below example is based on CIFAR-10 dataset.
+
+**Warning:**
+
+Please note that YARN service doesn't allow multiple services with the same
name, so please run following command
+```
+yarn application -destroy <service-name>
+```
+to delete services if you want to reuse the same service name.
+
+## Prepare Docker images
+
+Refer to [Write Dockerfile](WriteDockerfilePT.html) to build a Docker image or
use prebuilt one.
+
+## Running PyTorch jobs
+
+### Run standalone training
+
+```
+export
HADOOP_CLASSPATH="/home/systest/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar:/home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar"
+/opt/hadoop/bin/yarn jar
/home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar job run \
+--name pytorch-job-001 \
+--verbose \
+--framework pytorch \
+--wait_job_finish \
+--docker_image pytorch-latest-gpu:0.0.1 \
+--input_path hdfs://unused \
+--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre \
+--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.2 \
+--env YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true \
+--num_workers 1 \
+--worker_resources memory=5G,vcores=2 \
+--worker_launch_cmd "cd /test/ && python cifar10_tutorial.py"
+
+```
+
+For the meaning of the individual parameters, see the
[QuickStart](QuickStart.html) page!
+
+**Remarks:**
+Please note that the input path parameter is mandatory, but not yet used by
the PyTorch docker container.
\ No newline at end of file
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfilePT.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfilePT.md
new file mode 100644
index 0000000..84ca479
--- /dev/null
+++
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfilePT.md
@@ -0,0 +1,114 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# Creating Docker Images for Running PyTorch on YARN
+
+## How to create docker images to run PyTorch on YARN
+
+Dockerfile to run PyTorch on YARN needs two parts:
+
+**Base libraries which PyTorch depends on**
+
+1) OS base image, for example ```ubuntu:16.04```
+
+2) PyTorch dependent libraries and packages. For example ```python```,
```scipy```. For GPU support, you also need ```cuda```, ```cudnn```, etc.
+
+3) PyTorch package.
+
+**Libraries to access HDFS**
+
+1) JDK
+
+2) Hadoop
+
+Here's an example of a base image (with GPU support) to install PyTorch:
+```
+FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
+ARG PYTHON_VERSION=3.6
+RUN apt-get update && apt-get install -y --no-install-recommends \
+ build-essential \
+ cmake \
+ git \
+ curl \
+ vim \
+ ca-certificates \
+ libjpeg-dev \
+ libpng-dev \
+ wget &&\
+ rm -rf /var/lib/apt/lists/*
+
+
+RUN curl -o ~/miniconda.sh -O
https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
+ chmod +x ~/miniconda.sh && \
+ ~/miniconda.sh -b -p /opt/conda && \
+ rm ~/miniconda.sh && \
+ /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy
ipython mkl mkl-include cython typing && \
+ /opt/conda/bin/conda install -y -c pytorch magma-cuda100 && \
+ /opt/conda/bin/conda clean -ya
+ENV PATH /opt/conda/bin:$PATH
+RUN pip install ninja
+# This must be done before pip so that requirements.txt is available
+WORKDIR /opt/pytorch
+RUN git clone https://github.com/pytorch/pytorch.git
+WORKDIR pytorch
+RUN git submodule update --init
+RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" TORCH_NVCC_FLAGS="-Xfatbin
-compress-all" \
+ CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
+ pip install -v .
+
+WORKDIR /opt/pytorch
+RUN git clone https://github.com/pytorch/vision.git && cd vision && pip
install -v .
+
+```
+
+On top of above image, add files, install packages to access HDFS
+```
+RUN apt-get update && apt-get install -y openjdk-8-jdk wget
+# Install hadoop
+ENV HADOOP_VERSION="3.1.2"
+RUN wget
http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
+RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz
+RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current
+RUN rm hadoop-${HADOOP_VERSION}.tar.gz
+```
+
+Build and push to your own docker registry: Use ```docker build ... ``` and
```docker push ...``` to finish this step.
+
+## Use examples to build your own PyTorch docker images
+
+We provided some example Dockerfiles for you to build your own PyTorch docker
images.
+
+For latest PyTorch
+
+- *docker/pytorch/base/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest
Pytorch that supports GPU, which is prebuilt to CUDA10.
+-
*docker/pytorch/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*:
Latest Pytorch that GPU, which is prebuilt to CUDA10, with models.
+
+## Build Docker images
+
+### Manually build Docker image:
+
+Under `docker/pytorch` directory, run `build-all.sh` to build all Docker
images. This command will build the following Docker images:
+
+- `pytorch-latest-gpu-base:0.0.1` for base Docker image which includes Hadoop,
PyTorch, GPU base libraries.
+- `pytorch-latest-gpu:0.0.1` which includes cifar10 model as well
+
+### Use prebuilt images
+
+(No liability)
+You can also use prebuilt images for convenience:
+
+- hadoopsubmarine/pytorch-latest-gpu-base:0.0.1
diff --git
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfile.md
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfileTF.md
similarity index 87%
rename from
hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfile.md
rename to
hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfileTF.md
index 0d4c6c1..5dc565d 100644
---
a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfile.md
+++
b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfileTF.md
@@ -98,10 +98,10 @@ We provided following examples for you to build tensorflow
docker images.
For Tensorflow 1.13.1 (Precompiled to CUDA 10.x)
-- *docker/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1
supports CPU only.
-- *docker/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*:
Tensorflow 1.13.1 supports CPU only, and included models
-- *docker/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1
supports GPU, which is prebuilt to CUDA10.
-- *docker/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*:
Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models.
+- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow
1.13.1 supports CPU only.
+-
*docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*:
Tensorflow 1.13.1 supports CPU only, and included models
+- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow
1.13.1 supports GPU, which is prebuilt to CUDA10.
+-
*docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*:
Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models.
## Build Docker images
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]