This is an automated email from the ASF dual-hosted git repository.
ztang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/submarine.git
The following commit(s) were added to refs/heads/master by this push:
new d8b1b38 SUBMARINE-519. [WIP] refactor of submarine documentation
d8b1b38 is described below
commit d8b1b38fcaa331cfb2aaf79b9d26012137046b92
Author: Wangda Tan <[email protected]>
AuthorDate: Mon Jun 1 18:01:31 2020 -0700
SUBMARINE-519. [WIP] refactor of submarine documentation
### What is this PR for?
A few sentences describing the overall goals of the pull request's commits.
First time? Check out the contributing guide -
https://submarine.apache.org/contribution/contributions.html
### What type of PR is it?
[Documentation]
### Todos
* [ ] - Task
### What is the Jira issue?
* https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-519
### How should this be tested?
* First time? Setup Travis CI as described on
https://submarine.apache.org/contribution/contributions.html#continuous-integration
* Strongly recommended: add automated unit tests for any new or changed
behavior
* Outline any manual steps to test the PR here.
### Screenshots (if appropriate)
### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? Yes
Author: Wangda Tan <[email protected]>
Closes #303 from wangdatan/SUBMARINE-519 and squashes the following commits:
c4ca4d9 [Wangda Tan] changes ..
f268fba [Wangda Tan] Refactored structure of the doc, and removed
deprecated features such as yarn service runtime
b275c4d [Wangda Tan] Update and rename README.md to user-guide.md
b571724 [Wangda Tan] Create development-guide.md
aafc786 [Wangda Tan] Update README.md for examples and roadmap.
f0e30d8 [Wangda Tan] Update Readme.md, WIP
---
README.md | 99 ++--
dev-support/submarine-installer/README-CN.md | 2 +-
dev-support/submarine-installer/README.md | 2 +-
docs/README.md | 91 ---
docs/development-guide-home.md | 28 +
.../kaldi/RunningDistributedThchs30KaldiJobs.md | 16 +-
docs/helper/InstallationGuide.md | 549 -----------------
docs/helper/InstallationGuideChineseVersion.md | 648 ---------------------
docs/helper/MybatisGenerator.md | 59 --
docs/helper/QuickStart.md | 280 ---------
...nningDistributedCifar10TFJobsWithYarnService.md | 206 -------
...unningSingleNodeCifar10PTJobsWithYarnService.md | 72 ---
docs/user-guide-home.md | 31 +
docs/userdocs/k8s/README.md | 41 ++
.../README.md => userdocs/k8s/deploy-submarine.md} | 34 +-
.../k8s/run-tensorflow-on-k8s.md} | 6 +-
.../k8s}/setup-kubernetes.md | 76 ---
.../ml-frameworks => userdocs/k8s}/tensorflow.md | 0
docs/userdocs/yarn/Dockerfiles.md | 22 +
docs/userdocs/yarn/README.md | 37 ++
.../yarn}/TestAndTroubleshooting.md | 0
.../{helper => userdocs/yarn}/WriteDockerfileMX.md | 0
.../{helper => userdocs/yarn}/WriteDockerfilePT.md | 0
.../{helper => userdocs/yarn}/WriteDockerfileTF.md | 0
.../yarn/YARNRuntimeGuide.md} | 16 +-
.../base/ubuntu-18.04/Dockerfile.cpu.mx_latest | 0
.../base/ubuntu-18.04/Dockerfile.gpu.mx_latest | 0
.../yarn}/docker/mxnet/build-all.sh | 0
.../mxnet/cifar10/Dockerfile.cifar10.mx_1.5.1 | 0
.../ubuntu-18.04/Dockerfile.gpu.pytorch_latest | 0
.../yarn}/docker/pytorch/build-all.sh | 0
.../with-cifar10-models/cifar10_tutorial.py | 0
.../ubuntu-18.04/Dockerfile.gpu.pytorch_latest | 0
.../base/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1 | 0
.../base/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1 | 0
.../yarn}/docker/tensorflow/build-all.sh | 0
.../mnist/Dockerfile.tony.tf.mnist.tf_1.13.1 | 0
.../ubuntu-18.04/Dockerfile.cpu.tf_1.13.1 | 0
.../ubuntu-18.04/Dockerfile.gpu.tf_1.13.1 | 0
.../cifar10_estimator_tf_1.13.1/README.md | 0
.../cifar10_estimator_tf_1.13.1/cifar10.py | 0
.../cifar10_estimator_tf_1.13.1/cifar10_main.py | 0
.../cifar10_estimator_tf_1.13.1/cifar10_model.py | 0
.../cifar10_estimator_tf_1.13.1/cifar10_utils.py | 0
.../generate_cifar10_tfrecords.py | 0
.../cifar10_estimator_tf_1.13.1/model_base.py | 0
.../zeppelin-notebook-example/Dockerfile.gpu | 0
.../zeppelin-notebook-example/run_container.sh | 0
.../tensorflow/zeppelin-notebook-example/shiro.ini | 0
.../zeppelin-notebook-example/zeppelin-site.xml | 0
50 files changed, 263 insertions(+), 2052 deletions(-)
diff --git a/README.md b/README.md
index 5c78c7f..7580f09 100644
--- a/README.md
+++ b/README.md
@@ -20,92 +20,87 @@
# What is Apache Submarine?
-Apache Submarine is a unified AI platform which allows engineers and data
scientists to run Machine Learning and Deep Learning workload in distributed
cluster.
+Apache Submarine (Submarine for short) is the `ONE PLATFORM` to allow Data
Scientists to create end-to-end machine learning workflow. `ONE PLATFORM` means
it supports Data Scientists to finish their jobs on the same platform without
frequently switching their toolsets. From dataset exploring data pipeline
creation, model training (experiments), and push model to production (model
serving and monitoring). All these steps can be completed within the `ONE
PLATFORM`.
-Goals of Submarine:
-- It allows jobs easy access data/models in HDFS and other storages.
-- Can launch services to serve TensorFlow/PyTorch/MXNet models.
-- Support run distributed TensorFlow jobs with simple configs.
-- Support run user-specified Docker images.
-- Support specify GPU and other resources.
-- Support launch TensorBoard for training jobs if user specified.
-- Support customized DNS name for roles (like TensorBoard.$user.$domain:6006)
+## Why Submarine?
-# Architecture
+There're already a lot of open-source and comericial projects are trying to
create a end-to-end machine-learning/deep-learning platform, what's the vision
of Submarine?
-
+### Problems
-## Components
+1) Existing products lack of good User-Interface (API, SDK, etc) to run
training workload at scale, repeatable and easy for data scientist to
understand on cloud/premise.
+2) Data-Scientist want to focus on domain-specific target (e.g. improve
Click-Through-Rate), however available products always give user a platform (a
SDK to run distributed Pytorch script).
+3) Many products provided functionalities to do data exploring, model
training, and serving/monitoring. However these functionalities are largely
disconnected with each other. And cannot organically work with each other.
-### Submarine Workbench
+_Theodore Levitt_ once said:
-Submarine Workbench is a WEB system. Algorithm engineers can perform complete
lifecycle management of machine learning jobs in the Workbench.
+```
+“People don’t want to buy a quarter-inch drill. They want a quarter-inch hole.”
+```
-+ **Projects**
+### Goals of Submarine
- Manage machine learning jobs through project.
+#### Model Training (Experiment)
-+ **Data**
+- Can run experiment (training jobs) on prem, on cloud. Via easy-to-use
User-Interfaces
+- Easy for Data-Scientist (DS) to manage training code and dependencies
(Docker, Python Dependencies, etc.) .
+- ML-focused APIs to run/track experiment from Python SDK (notebook), REST
API, and CLI.
+- Provide APIs to run training jobs by using popular frameworks
(Standalone/Distributed TensorFlow/PyTorch/Hovorod).
+- Pre-packaged Training Template for Data-Scientists to focus on
domain-specific tasks (like using DeepFM to build a CTR prediction model).
+- Support GPU and other compute speed-up devides.
+- Support running on K8s/YARN or other resource management system.
+- Pipeline is also on the backlog, we will look into pipeline for training in
the future.
- Data processing, data conversion, feature engineering, etc. in the workbench.
+#### Notebook Service
-+ **Job**
+- Submarine is target to provide notebook service, which allows users to
create/edit/delete a notebook instance (such as a Jupyter notebook) running on
the cluster.
+- Users can submit experiement, manage models using Submarine SDK.
- Data processing, algorithm development, and model training in machine
learning jobs as a job run.
+#### Model Management (Serving/versioning/monitoring, etc.)
-+ **Model**
+- Model management for model-serving/versioning/monitoring is on the roadmap.
- Algorithm selection, parameter adjustment, model training, model release,
model Serving.
+## Easy-to-use User-Interface of Submarine
-+ **Workflow**
+Like mentioned above, Submarine is targeted to bring Data-Scientist-friendly
user-interfaces to make their life easier. Here're some examples of Submarine
user-interfaces.
- Automate the complete life cycle of machine learning operations by
scheduling workflows for data processing, model training, and model publishing.
+<FIXME: Add/FIX more contents below>
-+ **Team**
+<WIP>
- Support team development, code sharing, comments, code and model version
management.
+### Submit a distributed Tensorflow experiment via Submarine Python SDK
-### Submarine Core
+### Submit a pre-defined experiment template job
-The submarine core is the execution engine of the system and has the following
features:
+### Submit an experiment via Submarine UI
-- **ML Engine**
+(Available on 0.6.0, see Roadmap)
- Support for multiple machine learning framework access, such as tensorflow,
pytorch, mxnet.
+## Architecture, Design and requirements
-- **Data Engine**
+If you want to knwow more about Submarine's architecture, components,
requirements and design doc, they can be found on
[Architecture-and-requirement](docs/design/architecture-and-requirements.md)
- Docking the externally deployed Spark calculation engine for data processing.
+Detailed design documentation, implementation notes can be found at:
[Implementation notes](docs/design/implementation-notes.md)
-- **SDK**
-
- Support Python, Scala, R language for algorithm development, The SDK is
provided to help developers use submarine's internal data caching, data
exchange, and task tracking to more efficiently improve the development and
execution of machine learning tasks.
-
-- **Submitter**
-
- Compatible with the underlying hybrid scheduling system of yarn and k8s for
unified task scheduling and resource management, so that users are not aware.
-
-+ **Hybrid Scheduler**
- + **YARN**
- + **Kubernetes**
+## Apache Submarine Community
-## Quick start
+Read the [Apache Submarine Community Guide](./docs/community/README.md)
-### Run mini-submarine in one step
+How to contribute [Contributing Guide](./docs/community/contributing.md)
-You can use [mini-submarine](./dev-support/mini-submarine/README.md) for a
quick experience submairne.
+Issue Tracking: https://issues.apache.org/jira/projects/SUBMARINE
-This is a docker image built for submarine development and quick start test.
+## User Document
-### Installation and deployment
+See [User Guide Home Page](docs/user-guide-home.md)
-Read the [Quick Start Guide](./docs/helper/QuickStart.md)
+## Developper Document
-## Apache Submarine Community
+See [Developper Guide Home Page](docs/development-guide-home.md)
-Read the [Apache Submarine Community Guide](./docs/community/README.md)
+## Roadmap
-How to contribute [Contributing Guide](./docs/community/contributing.md)
+What to know more about what's coming for Submarine? Please check the roadmap
out: https://cwiki.apache.org/confluence/display/SUBMARINE/Roadmap
## License
diff --git a/dev-support/submarine-installer/README-CN.md
b/dev-support/submarine-installer/README-CN.md
index 6527ae8..ad0c02e 100644
--- a/dev-support/submarine-installer/README-CN.md
+++ b/dev-support/submarine-installer/README-CN.md
@@ -8,7 +8,7 @@ hadoop 在 2.9 版本中就已经让 YARN 支持了 Docker 容器的资源调度
由于分布式深度学习框架需要运行在多个 Docker
的容器之中,并且需要能够让运行在容器之中的各个服务相互协调,完成分布式机器学习的模型训练和模型发布等服务,这其中就会牵涉到 `DNS`、`Docker` 、
`GPU`、`Network`、`显卡`、`操作系统内核` 修改等多个系统工程问题,正确的部署好 **Hadoop {Submarine}**
的运行环境是一件很困难和耗时的事情。
-为了降低 hadoop 2.9 以上版本的 docker 等组件的部署难度,所以我们专门开发了这个用来部署 `Submarine` 运行时环境的
`submarine-installer`
项目,提供一键安装脚本,也可以分步执行安装、卸载、启动和停止各个组件,同时讲解每一步主要参数配置和注意事项。我们同时提供了
[中文手册](../../docs/helper/InstallationGuideChineseVersion.md) 和
[英文手册](../../docs/helper/InstallationGuide.md) ,帮助用户更容易的部署,发现问题也可以及时解决。
+为了降低 hadoop 2.9 以上版本的 docker 等组件的部署难度,所以我们专门开发了这个用来部署 `Submarine` 运行时环境的
`submarine-installer`
项目,提供一键安装脚本,也可以分步执行安装、卸载、启动和停止各个组件,同时讲解每一步主要参数配置和注意事项。我们同时提供了
[中文手册](project/github/submarine/docs/userdocs/yarn/InstallationGuideChineseVersion.md)
和 [英文手册](project/github/submarine/docs/userdocs/yarn/InstallationGuide.md)
,帮助用户更容易的部署,发现问题也可以及时解决。
## 先决条件
diff --git a/dev-support/submarine-installer/README.md
b/dev-support/submarine-installer/README.md
index c7ce650..c6af549 100644
--- a/dev-support/submarine-installer/README.md
+++ b/dev-support/submarine-installer/README.md
@@ -9,7 +9,7 @@ Hadoop has enabled YARN to support Docker container since 2.x.
**Submarine** the
Since the distributed deep learning framework needs to run in multiple Docker
containers and needs to be able to coordinate the various services running in
the container, complete the services of model training and model publishing for
distributed machine learning. Involving multiple system engineering problems
such as `DNS`, `Docker`, `GPU`, `Network`, `graphics card`, `operating system
kernel` modification, etc. It is very difficult and time-consuming to properly
deploy the **Submarine [...]
-In order to reduce the difficulty of deploying components, we have developed
this **submarine-installer** project to deploy the **Submarine** runtime
environment, providing a one-click installation script or step-by-step
installation. Unload, start, and stop individual components, and explain the
main parameter configuration and considerations for each step. We also provides
a [Chinese manual](../../docs/helper/InstallationGuideChineseVersion.md) and an
[English manual](../../docs/helper [...]
+In order to reduce the difficulty of deploying components, we have developed
this **submarine-installer** project to deploy the **Submarine** runtime
environment, providing a one-click installation script or step-by-step
installation. Unload, start, and stop individual components, and explain the
main parameter configuration and considerations for each step. We also provides
a [Chinese
manual](project/github/submarine/docs/userdocs/yarn/InstallationGuideChineseVersion.md)
and an [English [...]
This installer is just created for your convenience and for test purpose only.
You can choose to install required libraries by yourself, please don't run this
script in your production envionrment before fully validate it in a sandbox
environment.
diff --git a/docs/README.md b/docs/README.md
deleted file mode 100644
index 299bd7b..0000000
--- a/docs/README.md
+++ /dev/null
@@ -1,91 +0,0 @@
-<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-# Docs index
-
-Click below contents if you want to understand more.
-
-## Quick Start Guide
-
-[Quick Start Guide](./helper/QuickStart.md)
-
-## Build From Code
-
-[Build From Code Guide](./development/BuildFromCode.md)
-
-## Apache Submarine Community
-
-[Apache Submarine Community Guide](./community/README.md)
-
-## Submarine Workbench
-
-[Submarine Workbench Guide](./workbench/README.md)
-
-## Submarine Server
-[Submarine Server Guide](./submarine-server/README.md)
-
-## Examples
-
-Here're some examples about Submarine usage.
-
-[Running Distributed CIFAR 10 Tensorflow
Job_With_Yarn_Service_Runtime](helper/RunningDistributedCifar10TFJobsWithYarnService.md)
-
-[Running Standalone CIFAR 10 PyTorch
Job_With_Yarn_Service_Runtime](helper/RunningSingleNodeCifar10PTJobsWithYarnService.md)
-
-[Running Distributed thchs30 Kaldi
Job](./ecosystem/kaldi/RunningDistributedThchs30KaldiJobs.md)
-
-## Development
-
-[Submarine Project Development Guide](./development/README.md)
-
-[Submarine Project Database Guide](./database/README.md)
-
-## Dockerfile
-
-[How to write Dockerfile for Submarine TensorFlow
jobs](./helper/WriteDockerfileTF.md)
-
-[How to write Dockerfile for Submarine PyTorch
jobs](./helper/WriteDockerfilePT.md)
-
-[How to write Dockerfile for Submarine MXNet
jobs](./helper/WriteDockerfileMX.md)
-
-[How to write Dockerfile for Submarine Kaldi
jobs](./ecosystem/kaldi/WriteDockerfileKaldi.md)
-
-## Install Dependencies
-
-**Note: You need to install dependencies when using Hadoop YARN 3.1.x + or
above.**
-
-Submarine project may use YARN Service (When Submarine YARN service runtime is
being used, see [QuickStart](./helper/QuickStart.md), Docker container, and GPU
(when GPU hardware available and properly configured).
-
-That means as an admin you may have to properly setup YARN Service related
dependencies, including:
-
-- YARN Registry DNS
-
-Docker related dependencies, including:
-
-- Docker binary with expected versions.
-- Docker network which allows Docker container can talk to each other across
different nodes.
-
-And when GPU plans to be used:
-
-- GPU Driver.
-- Nvidia-docker.
-
-For your convenience, we provide installation documents to help you to set up
your environment. You can always choose to have them installed in your own way.
-
-Use Submarine installer to install dependencies:
[EN](../dev-support/submarine-installer/README.md)
[CN](../dev-support/submarine-installer/README-CN.md)
-
-Alternatively, you can follow manual install dependencies:
[EN](./helper/InstallationGuide.md)
[CN](./helper/InstallationGuideChineseVersion.md)
-
-Once you have installed dependencies, please follow following guide to
[TestAndTroubleshooting](./helper/TestAndTroubleshooting.md).
-
diff --git a/docs/development-guide-home.md b/docs/development-guide-home.md
new file mode 100644
index 0000000..4972319
--- /dev/null
+++ b/docs/development-guide-home.md
@@ -0,0 +1,28 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# Development Guide
+
+## Working-in-progress components
+
+### Submarine Workbench
+
+[Submarine Workbench Guide](./workbench/README.md)
+
+## Development
+
+[Submarine Project Development Guide](./development/README.md)
+
+[Submarine Project Database Guide](./database/README.md)
diff --git a/docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs.md
b/docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs.md
index cafe371..2eb32b4 100644
--- a/docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs.md
+++ b/docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs.md
@@ -100,7 +100,7 @@ Explanations:
- `>1` num_workers indicates it is a distributed training.
- Parameters / resources / Docker image of parameter server can be specified
separately. For many cases, parameter server doesn't require GPU.We don't need
parameter server here
-For the meaning of the individual parameters, see the
[QuickStart](../../helper/QuickStart.md) page!
+For the meaning of the individual parameters, see the
[QuickStart](project/github/submarine/docs/userdocs/yarn/QuickStart.md) page!
*Outputs of distributed training*
@@ -170,7 +170,7 @@ Get:32 http://archive.ubuntu.com/ubuntu xenial-updates/main
amd64 python3-reques
Get:33 http://archive.ubuntu.com/ubuntu xenial/main amd64 ssh-import-id all
5.5-0ubuntu1 [10.2 kB]
Fetched 12.1 MB in 0s (15.0 MB/s)
Selecting previously unselected package libatm1:amd64.
-(Reading database ...
+(Reading database ...
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
@@ -319,7 +319,7 @@ setting alias database
changing /etc/mailname to master-0.XXX
setting myorigin
setting destinations: $myhostname, master-0.XXX, localhost.XXX, , localhost
-setting relayhost:
+setting relayhost:
setting mynetworks: 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128
setting mailbox_size_limit: 0
setting recipient_delimiter: +
@@ -328,7 +328,7 @@ setting inet_protocols: all
/etc/aliases does not exist, creating it.
WARNING: /etc/aliases exists, but does not have a root alias.
-Postfix is now set up with a default configuration. If you need to make
+Postfix is now set up with a default configuration. If you need to make
changes, edit
/etc/postfix/main.cf (and others) as needed. To view Postfix configuration
values, see postconf(1).
@@ -462,7 +462,7 @@ Get:31 http://archive.ubuntu.com/ubuntu xenial-updates/main
amd64 python3-reques
Get:32 http://archive.ubuntu.com/ubuntu xenial/main amd64 ssh-import-id all
5.5-0ubuntu1 [10.2 kB]
Fetched 9633 kB in 2s (4496 kB/s)
Selecting previously unselected package libatm1:amd64.
-(Reading database ...
+(Reading database ...
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
@@ -608,7 +608,7 @@ setting alias database
changing /etc/mailname to worker-0.XXX
setting myorigin
setting destinations: $myhostname, worker-0.XXX, localhost.XXX, , localhost
-setting relayhost:
+setting relayhost:
setting mynetworks: 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128
setting mailbox_size_limit: 0
setting recipient_delimiter: +
@@ -617,7 +617,7 @@ setting inet_protocols: all
/etc/aliases does not exist, creating it.
WARNING: /etc/aliases exists, but does not have a root alias.
-Postfix is now set up with a default configuration. If you need to make
+Postfix is now set up with a default configuration. If you need to make
changes, edit
/etc/postfix/main.cf (and others) as needed. To view Postfix configuration
values, see postconf(1).
@@ -675,4 +675,4 @@ done for worker-0.XXX worker.
Sample output of sge:

-
\ No newline at end of file
+
diff --git a/docs/helper/InstallationGuide.md b/docs/helper/InstallationGuide.md
deleted file mode 100644
index e9ea3a5..0000000
--- a/docs/helper/InstallationGuide.md
+++ /dev/null
@@ -1,549 +0,0 @@
-# Submarine Installation Guide
-
-## Prerequisites
-
-(Please note that all following prerequisites are just an example for you to
install. You can always choose to install your own version of kernel, different
users, different drivers, etc.).
-
-### Operating System
-
-The operating system and kernel versions we have tested are as shown in the
following table, which is the recommneded minimum required versions.
-
-| Enviroment | Verion |
-| ------ | ------ |
-| Operating System | centos-release-7-3.1611.el7.centos.x86_64 |
-| Kernal | 3.10.0-514.el7.x86_64 |
-
-### User & Group
-
-As there are some specific users and groups recommended to be created to
install hadoop/docker. Please create them if they are missing.
-
-```
-adduser hdfs
-adduser mapred
-adduser yarn
-addgroup hadoop
-usermod -aG hdfs,hadoop hdfs
-usermod -aG mapred,hadoop mapred
-usermod -aG yarn,hadoop yarn
-usermod -aG hdfs,hadoop hadoop
-groupadd docker
-usermod -aG docker yarn
-usermod -aG docker hadoop
-```
-
-### GCC Version
-
-Check the version of GCC tool (to compile kernel).
-
-```bash
-gcc --version
-gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
-# install if needed
-yum install gcc make g++
-```
-
-### Kernel header & Kernel devel
-
-```bash
-# Approach 1:
-yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
-# Approach 2:
-wget
http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-514.el7.x86_64.rpm
-rpm -ivh kernel-headers-3.10.0-514.el7.x86_64.rpm
-```
-
-### GPU Servers (Only for Nvidia GPU equipped nodes)
-
-```
-lspci | grep -i nvidia
-
-# If the server has gpus, you can get info like this:
-04:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
-82:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
-```
-
-
-
-### Nvidia Driver Installation (Only for Nvidia GPU equipped nodes)
-
-To make a clean installation, if you have requirements to upgrade GPU drivers.
If nvidia driver/cuda has been installed before, They should be uninstalled
firstly.
-
-```
-# uninstall cuda:
-sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl
-
-# uninstall nvidia-driver:
-sudo /usr/bin/nvidia-uninstall
-```
-
-To check GPU version, install nvidia-detect
-
-```
-yum install nvidia-detect
-# run 'nvidia-detect -v' to get reqired nvidia driver version:
-nvidia-detect -v
-Probing for supported NVIDIA devices...
-[10de:13bb] NVIDIA Corporation GM107GL [Quadro K620]
-This device requires the current xyz.nm NVIDIA driver kmod-nvidia
-[8086:1912] Intel Corporation HD Graphics 530
-An Intel display controller was also detected
-```
-
-Pay attention to `This device requires the current xyz.nm NVIDIA driver
kmod-nvidia`.
-Download the installer like
[NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html).
-
-
-Some preparatory work for nvidia driver installation. (This is follow normal
Nvidia GPU driver installation, just put here for your convenience)
-
-```
-# It may take a while to update
-yum -y update
-yum -y install kernel-devel
-
-yum -y install epel-release
-yum -y install dkms
-
-# Disable nouveau
-vim /etc/default/grub
-# Add the following configuration in “GRUB_CMDLINE_LINUX” part
-rd.driver.blacklist=nouveau nouveau.modeset=0
-
-# Generate configuration
-grub2-mkconfig -o /boot/grub2/grub.cfg
-
-vim /etc/modprobe.d/blacklist.conf
-# Add confiuration:
-blacklist nouveau
-
-mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
-dracut /boot/initramfs-$(uname -r).img $(uname -r)
-reboot
-```
-
-Check whether nouveau is disabled
-
-```
-lsmod | grep nouveau # return null
-
-# install nvidia driver
-sh NVIDIA-Linux-x86_64-390.87.run
-```
-
-Some options during the installation
-
-```
-Install NVIDIA's 32-bit compatibility libraries (Yes)
-centos Install NVIDIA's 32-bit compatibility libraries (Yes)
-Would you like to run the nvidia-xconfig utility to automatically update your
X configuration file... (NO)
-```
-
-
-Check nvidia driver installation
-
-```
-nvidia-smi
-```
-
-Reference:
-https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
-
-
-
-### Docker Installation
-
-We recommend to use Docker version >= 1.12.5, following steps are just for
your reference. You can always to choose other approaches to install Docker.
-
-```
-yum -y update
-yum -y install yum-utils
-yum-config-manager --add-repo https://yum.dockerproject.org/repo/main/centos/7
-yum -y update
-
-# Show available packages
-yum search --showduplicates docker-engine
-
-# Install docker 1.12.5
-yum -y --nogpgcheck install docker-engine-1.12.5*
-systemctl start docker
-
-chown hadoop:netease /var/run/docker.sock
-chown hadoop:netease /usr/bin/docker
-```
-
-Reference:https://docs.docker.com/cs-engine/1.12/
-
-### Docker Configuration
-
-Add a file, named daemon.json, under the path of /etc/docker/. Please replace
the variables of image_registry_ip, etcd_host_ip, localhost_ip,
yarn_dns_registry_host_ip, dns_host_ip with specific ips according to your
environments.
-
-```
-{
- "insecure-registries": ["${image_registry_ip}:5000"],
-
"cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379",
- "cluster-advertise":"{localhost_ip}:2375",
- "dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip1}"],
- "hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"]
-}
-```
-
-Restart docker daemon:
-
-```
-sudo systemctl restart docker
-```
-
-
-
-### Docker EE version
-
-```bash
-$ docker version
-
-Client:
- Version: 1.12.5
- API version: 1.24
- Go version: go1.6.4
- Git commit: 7392c3b
- Built: Fri Dec 16 02:23:59 2016
- OS/Arch: linux/amd64
-
-Server:
- Version: 1.12.5
- API version: 1.24
- Go version: go1.6.4
- Git commit: 7392c3b
- Built: Fri Dec 16 02:23:59 2016
- OS/Arch: linux/amd64
-```
-
-### Nvidia-docker Installation (Only for Nvidia GPU equipped nodes)
-
-Submarine depends on nvidia-docker 1.0 version
-
-```
-wget -P /tmp
https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
-sudo rpm -i /tmp/nvidia-docker*.rpm
-# Start nvidia-docker
-sudo systemctl start nvidia-docker
-
-# Check nvidia-docker status:
-systemctl status nvidia-docker
-
-# Check nvidia-docker log:
-journalctl -u nvidia-docker
-
-# Test nvidia-docker-plugin
-curl http://localhost:3476/v1.0/docker/cli
-```
-
-According to `nvidia-driver` version, add folders under the path of
`/var/lib/nvidia-docker/volumes/nvidia_driver/`
-
-```
-mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87
-# 390.8 is nvidia driver version
-
-mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
-mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
-
-cp /usr/bin/nvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
-cp /usr/lib64/libcuda*
/var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
-cp /usr/lib64/libnvidia*
/var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
-
-# Test with nvidia-smi
-nvidia-docker run --rm nvidia/cuda:9.0-devel nvidia-smi
-```
-
-Test docker, nvidia-docker, nvidia-driver installation
-
-```
-# Test 1
-nvidia-docker run -rm nvidia/cuda nvidia-smi
-```
-
-```
-# Test 2
-nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash
-# In docker container
-python
-import tensorflow as tf
-tf.test.is_gpu_available()
-```
-
-[The way to uninstall nvidia-docker
1.0](https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0))
-
-Reference:
-https://github.com/NVIDIA/nvidia-docker/tree/1.0
-
-
-### Tensorflow Image
-
-How to build a Tensorflow Image, please refer to
[WriteDockerfileTF.md](WriteDockerfileTF.md)
-
-
-### Test tensorflow in a docker container
-
-After docker image is built, we can check
-Tensorflow environments before submitting a yarn job.
-
-```shell
-$ docker run -it ${docker_image_name} /bin/bash
-# >>> In the docker container
-$ python
-$ python >> import tensorflow as tf
-$ python >> tf.__version__
-```
-
-If there are some errors, we could check the following configuration.
-
-1. LD_LIBRARY_PATH environment variable
-
- ```
- echo $LD_LIBRARY_PATH
-
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
- ```
-
-2. The location of libcuda.so.1, libcuda.so
-
- ```
- ls -l /usr/local/nvidia/lib64 | grep libcuda.so
- ```
-
-
-## Hadoop Installation
-
-### Get Hadoop Release
-You can either get Hadoop release binary or compile from source code. Please
follow the guides from [Hadoop Homepage](https://hadoop.apache.org/).
-For hadoop cluster setup, please refer to [Hadoop Cluster
Setup](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html)
-
-
-### Start yarn services
-
-```
-YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager
-YARN_LOGFILE=nodemanager.log ./sbin/yarn-daemon.sh start nodemanager
-YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
-YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start
historyserver
-```
-
-### Test with a MR wordcount job
-
-```
-./bin/hadoop jar
/home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar
wordcount /tmp/wordcount.txt /tmp/wordcount-output4
-```
-
-## Yarn Configurations for GPU (Only if Nvidia GPU is used)
-
-### GPU configurations for both resourcemanager and nodemanager
-
-Add the yarn resource configuration file, named resource-types.xml
-
- ```
- <configuration>
- <property>
- <name>yarn.resource-types</name>
- <value>yarn.io/gpu</value>
- </property>
- </configuration>
- ```
-
-#### GPU configurations for resourcemanager
-
-The scheduler used by resourcemanager must be capacity scheduler, and
yarn.scheduler.capacity.resource-calculator in capacity-scheduler.xml should
be DominantResourceCalculator
-
- ```
- <configuration>
- <property>
- <name>yarn.scheduler.capacity.resource-calculator</name>
-
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
- </property>
- </configuration>
- ```
-
-#### GPU configurations for nodemanager
-
-Add configurations in yarn-site.xml
-
- ```
- <configuration>
- <property>
- <name>yarn.nodemanager.resource-plugins</name>
- <value>yarn.io/gpu</value>
- </property>
- </configuration>
- ```
-
-Add configurations in container-executor.cfg
-
- ```
- [docker]
- ...
- # Add configurations in `[docker]` part:
- # /usr/bin/nvidia-docker is the path of nvidia-docker command
- # nvidia_driver_375.26 means that nvidia driver version is <version>.
nvidia-smi command can be used to check the version
- docker.allowed.volume-drivers=/usr/bin/nvidia-docker
-
docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0
- docker.allowed.ro-mounts=nvidia_driver_<version>
-
- [gpu]
- module.enabled=true
-
- [cgroups]
- # /sys/fs/cgroup is the cgroup mount destination
- # /hadoop-yarn is the path yarn creates by default
- root=/sys/fs/cgroup
- yarn-hierarchy=/hadoop-yarn
- ```
-
-## TensorFlow Job with yarn runtime.
-
-### Run a TensorFlow job in a zipped python virtual environment
-
-Refer to build_python_virtual_env.sh in the directory of
-${SUBMARINE_REPO_PATH}/dev-support/mini-submarine/submarine/ to build a zipped
python virtual
-environment. ${SUBMARINE_REPO_PATH} indicates submarine repo location.
-The generated zipped file can be named myvenv.zip.
-
-Copy
${SUBMARINE_REPO_PATH}/dev-support/mini-submarine/submarine/run_submarine_mnist_tony.sh
-to the server on which you submit jobs. And modify the variables,
SUBMARINE_VERSION, SUBMARINE_HADOOP_VERSION, SUBMARINE_PATH,
-HADOOP_CONF_PATH and MNIST_PATH in it, according to your environment. If
Kerberos
-is enabled, please delete the parameter, --insecure, in the command.
-
-Run a distributed tensorflow job.
-```
-./run_submarine_mnist_tony.sh -d http://yann.lecun.com/exdb/mnist/
-```
-The parameter -d is used to specify the url from which we can get the mnist
data.
-
-### Run a TensorFlow job in a Docker container
-Prepare your docker image, you could refer to this sample Docker image for
building your own Docker image. An example is provided under
`docker/tensorflow/mnist/Dockerfile.tony.tf.mnist.tf_1.13.1`
-
-Please make sure you have _HADOOP_HOME_, _HADOOP_YARN_HOME_,
_HADOOP_HDFS_HOME_, _HADOOP_CONF_DIR_, _JAVA_HOME_ configured correctly. You
could use this command to run a distributed TensorFLow job in Docker
-
-```
-./run_submarine_mnist_tony.sh -c -d http://yann.lecun.com/exdb/mnist/
-```
-The parameter -c is used to specify the job will be run in a Docker
environment.
-
-The parameter -d is used to specify the url from which we can get the mnist
data.
-
-
-## Yarn Service Runtime Requirement (Deprecated)
-
-The function of "yarn native service" is available since hadoop 3.1.0.
-Submarine supports to utilize yarn native service to submit a ML job. However,
as
-there are several other components required. It is hard to enable
-and maintain the components. So yarn service runtime is deprecated since
submarine 0.3.0.
-We recommend to use YarnRuntime instead. If you still want to enable it, please
-follow these steps.
-
-### Etcd Installation
-
-etcd is a distributed reliable key-value store for the most critical data of a
distributed system, Registration and discovery of services used in containers.
-You can also choose alternatives like zookeeper, Consul.
-
-To install Etcd on specified servers, we can run Submarine-installer/install.sh
-
-```shell
-$ ./Submarine-installer/install.sh
-# Etcd status
-systemctl status Etcd.service
-```
-
-Check Etcd cluster health
-
-```shell
-$ etcdctl cluster-health
-member 3adf2673436aa824 is healthy: got healthy result from
http://${etcd_host_ip1}:2379
-member 85ffe9aafb7745cc is healthy: got healthy result from
http://${etcd_host_ip2}:2379
-member b3d05464c356441a is healthy: got healthy result from
http://${etcd_host_ip3}:2379
-cluster is healthy
-
-$ etcdctl member list
-3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380
clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
-85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380
clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
-b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380
clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
-```
-
-### Calico Installation
-
-Calico creates and manages a flat three-tier network, and each container is
assigned a routable ip. We just add the steps here for your convenience.
-You can also choose alternatives like Flannel, OVS.
-
-To install Calico on specified servers, we can run
Submarine-installer/install.sh
-
-```
-systemctl start calico-node.service
-systemctl status calico-node.service
-```
-
-#### Check Calico Network
-
-```shell
-# Run the following command to show the all host status in the cluster except
localhost.
-$ calicoctl node status
-Calico process is running.
-
-IPv4 BGP status
-+---------------+-------------------+-------+------------+-------------+
-| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
-+---------------+-------------------+-------+------------+-------------+
-| ${host_ip1} | node-to-node mesh | up | 2018-09-21 | Established |
-| ${host_ip2} | node-to-node mesh | up | 2018-09-21 | Established |
-| ${host_ip3} | node-to-node mesh | up | 2018-09-21 | Established |
-+---------------+-------------------+-------+------------+-------------+
-
-IPv6 BGP status
-No IPv6 peers found.
-```
-
-Create containers to validate calico network
-
-```
-docker network create --driver calico --ipam-driver calico-ipam calico-network
-docker run --net calico-network --name workload-A -tid busybox
-docker run --net calico-network --name workload-B -tid busybox
-docker exec workload-A ping workload-B
-```
-
-### Enable calico network for docker container
-Set yarn-site.xml to use bridge for docker container
-
-```
-<property>
-
<name>yarn.nodemanager.runtime.linux.docker.default-container-network</name>
- <value>calico-network</value>
- </property>
- <property>
- <name>yarn.nodemanager.runtime.linux.allowed-runtimes</name>
- <value>default,docker</value>
- </property>
- <property>
-
<name>yarn.nodemanager.runtime.linux.docker.allowed-container-networks</name>
- <value>host,none,bridge,calico-network</value>
- </property>
-```
-
-Add calico-network to container-executor.cfg
-
-```
-docker.allowed.networks=bridge,host,none,calico-network
-```
-
-Then restart all nodemanagers.
-
-### Start yarn registery dns service
-
-Yarn registry nds server exposes existing service-discovery information via DNS
-and enables docker containers to IP mappings. By using it, the containers of a
-ML job knows how to communicate with each other.
-
-Please specify a server to start yarn registery dns service. For details please
-refer to [Registry DNS
Server](http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/RegistryDNS.html)
-
-```
-sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
-```
-
-### Run a submarine job with yarn service runtime
-
-Please refer to [Running Distributed CIFAR 10 Tensorflow
Job_With_Yarn_Service_Runtime](RunningDistributedCifar10TFJobsWithYarnService.md)
\ No newline at end of file
diff --git a/docs/helper/InstallationGuideChineseVersion.md
b/docs/helper/InstallationGuideChineseVersion.md
deleted file mode 100644
index 1542c55..0000000
--- a/docs/helper/InstallationGuideChineseVersion.md
+++ /dev/null
@@ -1,648 +0,0 @@
-# Submarine 安装说明
-
-## Prerequisites
-
-### 操作系统
-
-我们使用的操作系统版本是 centos-release-7-3.1611.el7.centos.x86_64, 内核版本是
3.10.0-514.el7.x86_64 ,应该是最低版本了。
-
-| Enviroment | Verion |
-| ------ | ------ |
-| Operating System | centos-release-7-3.1611.el7.centos.x86_64 |
-| Kernal | 3.10.0-514.el7.x86_64 |
-
-### User & Group
-
-如果操作系统中没有这些用户组和用户,必须添加。一部分用户是 hadoop 运行需要,一部分用户是 docker 运行需要。
-
-```
-adduser hdfs
-adduser mapred
-adduser yarn
-addgroup hadoop
-usermod -aG hdfs,hadoop hdfs
-usermod -aG mapred,hadoop mapred
-usermod -aG yarn,hadoop yarn
-usermod -aG hdfs,hadoop hadoop
-groupadd docker
-usermod -aG docker yarn
-usermod -aG docker hadoop
-```
-
-### GCC 版本
-
-```bash
-gcc --version
-gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
-# 如果没有安装请执行以下命令进行安装
-yum install gcc make g++
-```
-
-### Kernel header & devel
-
-```bash
-# 方法一:
-yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
-# 方法二:
-wget
http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-514.el7.x86_64.rpm
-rpm -ivh kernel-headers-3.10.0-514.el7.x86_64.rpm
-```
-
-### 检查 GPU 版本
-
-```
-lspci | grep -i nvidia
-
-# 如果什么都没输出,就说明显卡不对,以下是我的输出:
-# 04:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
-# 82:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
-```
-
-
-
-### 安装 nvidia 驱动
-
-安装nvidia driver/cuda要确保已安装的nvidia driver/cuda已被清理
-
-```
-# 卸载cuda:
-sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl
-
-# 卸载nvidia-driver:
-sudo /usr/bin/nvidia-uninstall
-```
-
-安装 nvidia-detect,用于检查显卡版本
-
-```
-yum install nvidia-detect
-# 运行命令 nvidia-detect -v 返回结果:
-nvidia-detect -v
-Probing for supported NVIDIA devices...
-[10de:13bb] NVIDIA Corporation GM107GL [Quadro K620]
-This device requires the current 390.87 NVIDIA driver kmod-nvidia
-[8086:1912] Intel Corporation HD Graphics 530
-An Intel display controller was also detected
-```
-
-注意这里的信息 [Quadro K620] 和 390.87。
-下载
[NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html)
-
-
-安装前的一系列准备工作
-
-```
-# 若系统很久没更新,这句可能耗时较长
-yum -y update
-yum -y install kernel-devel
-
-yum -y install epel-release
-yum -y install dkms
-
-# 禁用nouveau
-vim /etc/default/grub #在“GRUB_CMDLINE_LINUX”中添加内容 rd.driver.blacklist=nouveau
nouveau.modeset=0
-grub2-mkconfig -o /boot/grub2/grub.cfg # 生成配置
-vim /etc/modprobe.d/blacklist.conf # 打开(新建)文件,添加内容blacklist nouveau
-
-mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
-dracut /boot/initramfs-$(uname -r).img $(uname -r) # 更新配置,并重启
-reboot
-```
-
-开机后确认是否禁用
-
-```
-lsmod | grep nouveau # 应该返回空
-
-# 开始安装
-sh NVIDIA-Linux-x86_64-390.87.run
-```
-
-安装过程中,会遇到一些选项:
-
-```
-Install NVIDIA's 32-bit compatibility libraries (Yes)
-centos Install NVIDIA's 32-bit compatibility libraries (Yes)
-Would you like to run the nvidia-xconfig utility to automatically update your
X configuration file... (NO)
-```
-
-
-最后查看 nvidia gpu 状态
-
-```
-nvidia-smi
-```
-
-reference:
-https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
-
-
-
-### 安装 Docker
-
-```
-yum -y update
-yum -y install yum-utils
-yum-config-manager --add-repo https://yum.dockerproject.org/repo/main/centos/7
-yum -y update
-
-# 显示 available 的安装包
-yum search --showduplicates docker-engine
-
-# 安装 1.12.5 版本 docker
-yum -y --nogpgcheck install docker-engine-1.12.5*
-systemctl start docker
-
-chown hadoop:netease /var/run/docker.sock
-chown hadoop:netease /usr/bin/docker
-```
-
-Reference:https://docs.docker.com/cs-engine/1.12/
-
-### 配置 Docker
-
-在 `/etc/docker/` 目录下,创建`daemon.json`文件, 添加以下配置,变量如image_registry_ip,
etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ip需要根据具体环境,进行修改
-
-```
-{
- "insecure-registries": ["${image_registry_ip}:5000"],
-
"cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379",
- "cluster-advertise":"{localhost_ip}:2375",
- "dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip1}"],
- "hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"]
-}
-```
-
-重启 docker daemon:
-
-```
-sudo systemctl restart docker
-```
-
-
-
-### Docker EE version
-
-```bash
-$ docker version
-
-Client:
- Version: 1.12.5
- API version: 1.24
- Go version: go1.6.4
- Git commit: 7392c3b
- Built: Fri Dec 16 02:23:59 2016
- OS/Arch: linux/amd64
-
-Server:
- Version: 1.12.5
- API version: 1.24
- Go version: go1.6.4
- Git commit: 7392c3b
- Built: Fri Dec 16 02:23:59 2016
- OS/Arch: linux/amd64
-```
-
-### 安装 nvidia-docker
-
-Hadoop-3.2 的 submarine 使用的是 1.0 版本的 nvidia-docker
-
-```
-wget -P /tmp
https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm
-sudo rpm -i /tmp/nvidia-docker*.rpm
-# 启动 nvidia-docker
-sudo systemctl start nvidia-docker
-
-# 查看 nvidia-docker 状态:
-systemctl status nvidia-docker
-
-# 查看 nvidia-docker 日志:
-journalctl -u nvidia-docker
-
-# 查看 nvidia-docker-plugin 是否正常
-curl http://localhost:3476/v1.0/docker/cli
-```
-
-在 `/var/lib/nvidia-docker/volumes/nvidia_driver/` 路径下,根据 `nvidia-driver`
的版本创建文件夹:
-
-```
-mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87
-# 其中390.87是nvidia driver的版本号
-
-mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
-mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
-
-cp /usr/bin/nvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
-cp /usr/lib64/libcuda*
/var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
-cp /usr/lib64/libnvidia*
/var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
-
-# Test nvidia-smi
-nvidia-docker run --rm nvidia/cuda:9.0-devel nvidia-smi
-```
-
-测试 docker, nvidia-docker, nvidia-driver 安装
-
-```
-# 测试一
-nvidia-docker run -rm nvidia/cuda nvidia-smi
-```
-
-```
-# 测试二
-nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash
-# 在docker中执行
-python
-import tensorflow as tf
-tf.test.is_gpu_available()
-```
-
-卸载 nvidia-docker 1.0 的方法:
-https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
-
-reference:
-https://github.com/NVIDIA/nvidia-docker/tree/1.0
-
-
-
-### Tensorflow Image
-
-创建 tensorflow docker image
的方法,可以参考文档[WriteDockerfileTF.md](WriteDockerfileTF.md)
-
-### 测试 TF 环境
-
-创建好 docker 镜像后,需要先手动检查 TensorFlow 是否可以正常使用,避免通过 YARN 调度后出现问题,可以执行以下命令
-
-```shell
-$ docker run -it ${docker_image_name} /bin/bash
-# >>> 进入容器
-$ python
-$ python >> import tensorflow as tf
-$ python >> tf.__version__
-```
-
-如果出现问题,可以按照以下路径进行排查
-
-1. 环境变量是否设置正确
-
- ```
- echo $LD_LIBRARY_PATH
-
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
- ```
-
-2. libcuda.so.1,libcuda.so是否在LD_LIBRARY_PATH指定的路径中
-
- ```
- ls -l /usr/local/nvidia/lib64 | grep libcuda.so
- ```
-
-## 安装 Hadoop
-
-### 安装 Hadoop
-首先,我们通过源码编译或者直接从官网 [Hadoop Homepage](https://hadoop.apache.org/)下载获取 hadoop 包。
-然后,请参考 [Hadoop Cluster
Setup](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html)
-进行 Hadoop 集群安装。
-
-
-
-### 启动 YARN 服务
-
-```
-YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager
-YARN_LOGFILE=nodemanager.log ./sbin/yarn-daemon.sh start nodemanager
-YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
-YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start
historyserver
-```
-
-
-### 测试 wordcount
-
-通过测试最简单的 wordcount ,检查 YARN 是否正确安装
-
-```
-./bin/hadoop jar
/home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar
wordcount /tmp/wordcount.txt /tmp/wordcount-output4
-```
-
-## yarn 使用 GPU 的配置
-
-### Resourcemanager, Nodemanager 中添加GPU支持
-
-在 yarn 配置文件夹(conf或etc/hadoop)中创建 resource-types.xml,添加:
-
- ```
- <configuration>
- <property>
- <name>yarn.resource-types</name>
- <value>yarn.io/gpu</value>
- </property>
- </configuration>
- ```
-
-### Resourcemanager 的 GPU 配置
-
-resourcemanager 使用的 scheduler 必须是 capacity scheduler,在 capacity-scheduler.xml
中修改属性:
-
- ```
- <configuration>
- <property>
- <name>yarn.scheduler.capacity.resource-calculator</name>
-
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
- </property>
- </configuration>
- ```
-
-### Nodemanager 的 GPU 配置
-
-在 nodemanager 的 yarn-site.xml 中添加配置:
-
- ```
- <configuration>
- <property>
- <name>yarn.nodemanager.resource-plugins</name>
- <value>yarn.io/gpu</value>
- </property>
- </configuration>
- ```
-
-在 container-executor.cfg 中添加配置:
-
- ```
- [docker]
- ...
- # 在[docker]已有配置中,添加以下内容:
- # /usr/bin/nvidia-docker是nvidia-docker路径
- # nvidia_driver_375.26的版本号375.26,可以使用nvidia-smi查看
- docker.allowed.volume-drivers=/usr/bin/nvidia-docker
-
docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0
- docker.allowed.ro-mounts=nvidia_driver_375.26
-
- [gpu]
- module.enabled=true
-
- [cgroups]
- # /sys/fs/cgroup是cgroup的mount路径
- # /hadoop-yarn是yarn在cgroup路径下默认创建的path
- root=/sys/fs/cgroup
- yarn-hierarchy=/hadoop-yarn
- ```
-
-## 使用 yarn runtime 提交 tensorflow 任务.
-
-### 使用 zipped python virtual environment 测试 tensorflow job
-
-使用 ${SUBMARINE_REPO_PATH}/dev-support/mini-submarine/submarine/ 目录下的
build_python_virtual_env.sh 文件,
-创建 zipped python virtual environment。生成的压缩文件可以被命名为 myvenv.zip,其中
-${SUBMARINE_REPO_PATH} 表示 submarine 代码的根路径。
-
-复制文件
${SUBMARINE_REPO_PATH}/dev-support/mini-submarine/submarine/run_submarine_mnist_tony.sh
-到提交任务的服务器节点上。根据环境修改其中的变量 SUBMARINE_VERSION,HADOOP_VERSION,SUBMARINE_PATH,
-HADOOP_CONF_PATH,MNIST_PATH。如果开启了 Kerberos 安全认证,请删除命令里的参数 --insecure。
-
-执行一个分布式的 tensorflow 任务.
-```
-./run_submarine_mnist_tony.sh -d http://yann.lecun.com/exdb/mnist/
-```
-参数 -d 用来指定下载 mnist 数据的url地址。
-
-### 使用 docker container 提交 tensorflow 任务 (TODO)
-
-
-## Yarn Service Runtime (不推荐)
-
-hadoop 3.1.0 提供了 yarn native service 功能,Submarine 可以利用 yarn native service
提交分布式机器学习任务。
-但是,由于使用 yarn native service 会引入一些额外的组件,导致部署和运维服务比较困难,因而在 Submarine 0.3.0之后
Yarn Server Runtime 不再推荐使用。我们建议直接使用
-YarnRuntime,这样可以在 yarn 2.9 上提交机器学习任务。
-开启 Yarn Service Runtime,可以参照下面的方法
-
-### 安装 Etcd
-
-运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Etcd 组件和服务自启动脚本。
-
-```shell
-$ ./Submarine/install.sh
-# 通过如下命令查看 Etcd 服务状态
-systemctl status Etcd.service
-```
-
-检查 Etcd 服务状态
-
-```shell
-$ etcdctl cluster-health
-member 3adf2673436aa824 is healthy: got healthy result from
http://${etcd_host_ip1}:2379
-member 85ffe9aafb7745cc is healthy: got healthy result from
http://${etcd_host_ip2}:2379
-member b3d05464c356441a is healthy: got healthy result from
http://${etcd_host_ip3}:2379
-cluster is healthy
-
-$ etcdctl member list
-3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380
clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
-85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380
clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
-b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380
clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
-```
-其中,${etcd_host_ip*} 是 etcd 服务器的 ip
-
-
-### 安装 Calico
-
-运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Calico 组件和服务自启动脚本。
-
-```
-systemctl start calico-node.service
-systemctl status calico-node.service
-```
-
-#### 检查 Calico 网络
-
-```shell
-# 执行如下命令,注意:不会显示本服务器的状态,只显示其他的服务器状态
-$ calicoctl node status
-Calico process is running.
-
-IPv4 BGP status
-+---------------+-------------------+-------+------------+-------------+
-| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
-+---------------+-------------------+-------+------------+-------------+
-| ${host_ip1} | node-to-node mesh | up | 2018-09-21 | Established |
-| ${host_ip2} | node-to-node mesh | up | 2018-09-21 | Established |
-| ${host_ip3} | node-to-node mesh | up | 2018-09-21 | Established |
-+---------------+-------------------+-------+------------+-------------+
-
-IPv6 BGP status
-No IPv6 peers found.
-```
-
-创建 docker container,验证 calico 网络
-
-```
-docker network create --driver calico --ipam-driver calico-ipam calico-network
-docker run --net calico-network --name workload-A -tid busybox
-docker run --net calico-network --name workload-B -tid busybox
-docker exec workload-A ping workload-B
-```
-
-### Yarn Docker container开启Calico网络
-在配置文件 yarn-site.xml,为 docker container 设置 Calico 网络。
-
-```
-<property>
-
<name>yarn.nodemanager.runtime.linux.docker.default-container-network</name>
- <value>calico-network</value>
- </property>
- <property>
- <name>yarn.nodemanager.runtime.linux.allowed-runtimes</name>
- <value>default,docker</value>
- </property>
- <property>
-
<name>yarn.nodemanager.runtime.linux.docker.allowed-container-networks</name>
- <value>host,none,bridge,calico-network</value>
- </property>
-```
-
-在配置文件 container-executor.cfg 中,添加 bridge 网络
-
-```
-docker.allowed.networks=bridge,host,none,calico-network
-```
-
-重启所有的 nodemanager 节点.
-
-
-### 启动 registery dns 服务
-
-Yarn registry nds server 是为服务发现功能而实现的DNS服务。yarn docker container 通过向 registry
nds server 注册,对外暴露 container 域名与 container IP/port 的映射关系。
-
-Yarn registery dns 的详细配置信息和部署方式,可以参考 [Registry DNS
Server](http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/RegistryDNS.html)
-
-启动 registry nds 命令
-```
-sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
-```
-
-### 运行 submarine 任务
-
-使用 yarn service runtime 提交 submarine 任务的方法,请参考文档 [Running Distributed CIFAR 10
Tensorflow
Job_With_Yarn_Service_Runtime](RunningDistributedCifar10TFJobsWithYarnService.md)
-
-
-## 问题
-
-### 问题一: 操作系统重启导致 nodemanager 启动失败
-
-```
-2018-09-20 18:54:39,785 ERROR
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to
bootstrap configured resource subsystems!
-org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts
User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn
- at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425)
- at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377)
- at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98)
- at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87)
- at
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
- at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
- at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389)
- at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
- at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
- at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
-2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService:
Service NodeManager failed in state INITED
-```
-
-解决方法:使用 `root` 账号给 `yarn` 用户修改 `/sys/fs/cgroup/cpu,cpuacct` 的权限
-
-```
-chown :yarn -R /sys/fs/cgroup/cpu,cpuacct
-chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct
-```
-
-在支持 gpu 时,还需 cgroup devices 路径权限
-
-```
-chown :yarn -R /sys/fs/cgroup/devices
-chmod g+rwx -R /sys/fs/cgroup/devices
-```
-
-
-### 问题二:container-executor 权限问题
-
-```
-2018-09-21 09:36:26,102 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
IOException executing command:
-java.io.IOException: Cannot run program
"/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied
- at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
- at org.apache.hadoop.util.Shell.runCommand(Shell.java:938)
- at org.apache.hadoop.util.Shell.run(Shell.java:901)
- at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
-```
-
-`/etc/yarn/sbin/Linux-amd64-64/container-executor` 该文件的权限应为6050
-
-### 问题三:查看系统服务启动日志
-
-```
-journalctl -u docker
-```
-
-### 问题四:Docker 无法删除容器的问题 `device or resource busy`
-
-```bash
-$ docker rm 0bfafa146431
-Error response from daemon: Unable to remove filesystem for
0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove
/app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm:
device or resource busy
-```
-
-编写 `find-busy-mnt.sh` 脚本,检查 `device or resource busy` 状态的容器挂载文件
-
-```bash
-#!/bin/bash
-
-# A simple script to get information about mount points and pids and their
-# mount namespaces.
-
-if [ $# -ne 1 ];then
-echo "Usage: $0 <devicemapper-device-id>"
-exit 1
-fi
-
-ID=$1
-
-MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null`
-
-[ -z "$MOUNTS" ] && echo "No pids found" && exit 0
-
-printf "PID\tNAME\t\tMNTNS\n"
-echo "$MOUNTS" | while read LINE; do
-PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3`
-# Ignore self and thread-self
-if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then
- continue
-fi
-NAME=`ps -q $PID -o comm=`
-MNTNS=`readlink /proc/$PID/ns/mnt`
-printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS"
-done
-```
-
-查找占用目录的进程
-
-```bash
-$ chmod +x find-busy-mnt.sh
-./find-busy-mnt.sh
0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a
-# PID NAME MNTNS
-# 5007 ntpd mnt:[4026533598]
-$ kill -9 5007
-```
-
-
-### 问题五:命令 sudo nvidia-docker run 报错
-
-```
-docker: Error response from daemon: create nvidia_driver_361.42:
VolumeDriver.Create: internal error, check logs for details.
-See 'docker run --help'.
-```
-
-解决方法:
-
-```
-#查看nvidia-docker状态,是不是启动有问题,可以使用
-$ systemctl status nvidia-docker
-$ journalctl -n -u nvidia-docker
-#重启下nvidia-docker
-systemctl stop nvidia-docker
-systemctl start nvidia-docker
-```
-
-### 问题六:YARN 启动容器失败
-
-如果你创建的容器数(PS+Work>GPU显卡总数),可能会出现容器创建失败,那是因为在一台服务器上同时创建了超过本机显卡总数的容器。
diff --git a/docs/helper/MybatisGenerator.md b/docs/helper/MybatisGenerator.md
deleted file mode 100644
index 4c2d0f3..0000000
--- a/docs/helper/MybatisGenerator.md
+++ /dev/null
@@ -1,59 +0,0 @@
-<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-
-## Introduction to MyBatis Generator Usage
-
-## summary
-
-[Official Website](http://mybatis.org/generator/ "link")
-
-MyBatis Generator (MBG) is a code generator for MyBatis MyBatis and iBATIS.
-It will generate code for all versions of MyBatis, and versions of iBATIS
after
-version 2.2.0. It will introspect a database table (or many tables) and will
-generate artifacts that can be used to access the table(s). This lessens the
-initial nuisance of setting up objects and configuration files to interact
-with database tables. MBG seeks to make a major impact on the large percentage
-of database operations that are simple CRUD (Create, Retrieve, Update,
Delete).
-You will still need to hand code SQL and objects for join queries, or stored
procedures.
-
-## Quick Start
-
-### Add plug-in dependencies in pom.xml
-The plug-in has been added in the pom.xml of the _workbench-server_.
-
-```
-<dependency>
- <groupId>org.mybatis.generator</groupId>
- <artifactId>mybatis-generator-core</artifactId>
- <version>1.3.7</version>
-</dependency>
-```
-
-### Add plug-in dependencies in pom.xml
-Edit the mbgConfiguration.xml file. We need to modify the following:
-1. We need to modify the JDBC connection information, such as driverClass,
-connectionURL, userId, password.
-2. targetProject: You can specify a specific path as file storage path.
e.g./tmp.
-3. **tableName** and **domainObjectName**: List all the table to generate the
code.
-
-### Add main class
-We have been added main class named _MybatisGeneratorMain_ in the
_workbench-server_
-project _org.apache.submarine.database.utils_ package path.
-
-### Generator file
-Run the main method to get the file, Take the idea development tool as an
example:
-The selected MybatisGeneratorMain class, Right click to run the main method.
-We can find the file under the targetProject path that we just configured.
-including: entity, TableNameMapper.java, TableNameMapper.xml
diff --git a/docs/helper/QuickStart.md b/docs/helper/QuickStart.md
deleted file mode 100644
index 2987082..0000000
--- a/docs/helper/QuickStart.md
+++ /dev/null
@@ -1,280 +0,0 @@
-<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-
-# Quick Start Guide
-
-## Prerequisite
-
-Must:
-
-- Apache Hadoop version newer than 2.7.3
-
-Optional:
-
-- [Enable YARN
Service](https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html)
(Only yarn 3.1.x + needs)
-- [Enable GPU on YARN
2.10.0+](https://hadoop.apache.org/docs/r2.10.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html)
-- [Enable Docker on YARN
2.8.2+](https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-site/DockerContainers.html)
-- [Build Docker images](WriteDockerfileTF.md)
-
-
-<br />
-
-## Submarine Configuration
-
-After submarine 0.2.0, it supports two runtimes which are YARN service runtime
and Linkedin's TonY runtime for YARN. Each runtime can support TensorFlow,
PyTorch and MXNet framework. But we don't need to worry about the usage because
the two runtime implements the same interface.
-
-So before we start running a job, the runtime type should be picked. The
runtime choice may vary depending on different requirements. Check below table
to choose your runtime.
-
-Note that if you want to quickly try Submarine on new or existing YARN
cluster, use TonY runtime will help you get start easier(0.2.0+)
-
-| Hadoop (YARN) version | Is Docker Enabled | Is GPU Enabled | Acceptable
Runtime |
-| :-------------------- | :---------------- | -------------- |
------------------- |
-| 2.7.3 ~2.9.x | X | X | Tony Runtime
|
-| 2.9.x ~3.1.0 | Y | Y | Tony Runtime
|
-| 3.1.0+ | Y | Y | Tony /
YANR-Service |
-
-For the environment setup, please check
[InstallationGuide](InstallationGuide.md) or
[InstallationGuideCN](InstallationGuideChineseVersion.md)
-
-Once the applicable runtime is chosen and environment is ready, a
`submarine-site.xml` need to be created under `$HADOOP_CONF_DIR`. To use the
TonY runtime, please set below value in the submarine configuration.
-
-|Configuration Name | Description |
-|:---- |:---- |
-| `submarine.runtime.class` |
"org.apache.submarine.server.submitter.yarn.YarnRuntimeFactory" or
"org.apache.submarine.server.submitter.yarnservice.YarnServiceRuntimeFactory" |
-
-<br />
-
-A sample `submarine-site.xml` is here:
-```java
-<?xml version="1.0"?>
-<configuration>
- <property>
- <name>submarine.runtime.class</name>
-
<value>org.apache.submarine.server.submitter.yarn.YarnRuntimeFactory</value>
- <!-- Alternatively, you can use:
-
<value>org.apache.submarine.server.submitter.yarnservice.YarnServiceRuntimeFactory</value>
- -->
- </property>
-</configuration>
-```
-
-For more Submarine configuration:
-
-|Configuration Name | Description |
-|:---- |:---- |
-| `submarine.localization.max-allowed-file-size-mb` | Optional. This sets a
size limit to the file/directory to be localized in "-localization" CLI option.
2GB by default. |
-
-
-<br />
-
-## Launch Training Job
-This section will give us an idea of how the submarine CLI looks like.
-
-Although the run job command looks simple, different job may have very
different parameters.
-
-For a quick try on Mnist example with TonY runtime, check [TonY Mnist
Example](TonYRuntimeGuide.md)
-
-
-For a quick try on Cifar10 example with YARN native service runtime, check
[YARN Service Cifar10
Example](RunningDistributedCifar10TFJobsWithYarnService.md)
-
-
-<br />
-## Get job history / logs
-
-### Get Job Status from CLI
-
-```shell
-CLASSPATH=path-to/hadoop-conf:path-to/hadoop-submarine-all-${SUBMARINE_VERSION}-hadoop-${HADOOP_VERSION}.jar
\
-java org.apache.submarine.client.cli.Cli job show --name tf-job-001
-```
-
-Output looks like:
-```shell
-Job Meta Info:
- Application Id: application_1532131617202_0005
- Input Path: hdfs://default/dataset/cifar-10-data
- Checkpoint Path: hdfs://default/tmp/cifar-10-jobdir
- Run Parameters: --name tf-job-001 --docker_image <your-docker-image>
- (... all your commandline before run the job)
-```
-
-After that, you can run ```tensorboard --logdir=<checkpoint-path>``` to view
Tensorboard of the job.
-
-### Get component logs from a training job
-
-We can use `yarn logs -applicationId <applicationId>` to get logs from CLI.
-Or from YARN UI:
-
-
-
-<br />
-
-## Submarine Commandline options
-
-```$xslt
-usage: job run
- -checkpoint_path <arg> Training output directory of the job,
- could be local or other FS directory.
- This typically includes checkpoint
- files and exported model
- -conf <arg> User specified configuration, as
- key=val pairs.
- -distribute_keytab Distribute local keytab to cluster
- machines for service authentication. If
- not specified, pre-distributed keytab
- of which path specified by
- parameterkeytab on cluster machines
- will be used
- -docker_image <arg> Docker image name/tag
- -env <arg> Common environment variable of
- worker/ps
- -f <arg> Config file (in YAML format)
- -framework <arg> Framework to use. Valid values are:
- tensorlow,pytorch,mxnet! The default
- framework is tensorflow.
- -h,--help Print help
- -input_path <arg> Input of the job, could be local or
- other FS directory
- -insecure Cluster is not Kerberos enabled.
- -keytab <arg> Specify keytab used by the job under
- security environment
- -localization <arg> Specify localization to make
- remote/local file/directory available
- to all container(Docker). Argument
- format is "RemoteUri:LocalFilePath[:rw]
- " (ro permission is not supported yet)
- The RemoteUri can be a file or
- directory in local or HDFS or s3 or
- abfs or http .etc. The LocalFilePath
- can be absolute or relative. If it's a
- relative path, it'll be under
- container's implied working directory
- but sub directory is not supported yet.
- This option can be set mutiple times.
- Examples are
- -localization
- "hdfs:///user/yarn/mydir2:/opt/data"
- -localization "s3a:///a/b/myfile1:./"
- -localization
- "https:///a/b/myfile2:./myfile"
- -localization
- "/user/yarn/mydir3:/opt/mydir3"
- -localization "./mydir1:."
- -name <arg> Name of the job
- -num_ps <arg> Number of PS tasks of the job, by
- default it's 0. Can be used with
- TensorFlow or MXNet frameworks.
- -num_schedulers <arg> Number of scheduler tasks of the job.
- It should be 1 or 0, by default it's
- 0.Can only be used with MXNet
- framework.
- -num_workers <arg> Number of worker tasks of the job, by
- default it's 1.Can be used with
- TensorFlow or PyTorch or MXNet
- frameworks.
- -principal <arg> Specify principal used by the job under
- security environment
- -ps_docker_image <arg> Specify docker image for PS, when this
- is not specified, PS uses
- --docker_image as default.Can be used
- with TensorFlow or MXNet frameworks.
- -ps_launch_cmd <arg> Commandline of PS, arguments will be
- directly used to launch the PSCan be
- used with TensorFlow or MXNet
- frameworks.
- -ps_resources <arg> Resource of each PS, for example
- memory-mb=2048,vcores=2,yarn.io/gpu=2Ca
- n be used with TensorFlow or MXNet
- frameworks.
- -queue <arg> Name of queue to run the job, by
- default it uses default queue
- -quicklink <arg> Specify quicklink so YARNweb UI shows
- link to given role instance and port.
- When --tensorboard is specified,
- quicklink to tensorboard instance will
- be added automatically. The format of
- quick link is: Quick_link_label=http(or
- https)://role-name:port. For example,
- if want to link to first worker's 7070
- port, and text of quicklink is
- Notebook_UI, user need to specify
- --quicklink
- Notebook_UI=https://master-0:7070
- -saved_model_path <arg> Model exported path (savedmodel) of the
- job, which is needed when exported
- model is not placed under
- ${checkpoint_path}could be local or
- other FS directory. This will be used
- to serve.
- -scheduler_docker_image <arg> Specify docker image for scheduler,
- when this is not specified, scheduler
- uses --docker_image as default. Can
- only be used with MXNet framework.
- -scheduler_launch_cmd <arg> Commandline of scheduler, arguments
- will be directly used to launch the
- scheduler. Can only be used with MXNet
- framework.
- -scheduler_resources <arg> Resource of each scheduler, for example
- memory-mb=2048,vcores=2. Can only be
- used with MXNet framework.
- -tensorboard Should we run TensorBoard for this job?
- By default it's disabled.Can only be
- used with TensorFlow framework.
- -tensorboard_docker_image <arg> Specify Tensorboard docker image. when
- this is not specified, Tensorboard uses
- --docker_image as default.Can only be
- used with TensorFlow framework.
- -tensorboard_resources <arg> Specify resources of Tensorboard, by
- default it is memory=4G,vcores=1.Can
- only be used with TensorFlow framework.
- -verbose Print verbose log for troubleshooting
- -wait_job_finish Specified when user want to wait the
- job finish
- -worker_docker_image <arg> Specify docker image for WORKER, when
- this is not specified, WORKER uses
- --docker_image as default.Can be used
- with TensorFlow or PyTorch or MXNet
- frameworks.
- -worker_launch_cmd <arg> Commandline of worker, arguments will
- be directly used to launch the
- workerCan be used with TensorFlow or
- PyTorch or MXNet frameworks.
- -worker_resources <arg> Resource of each worker, for example
- memory-mb=2048,vcores=2,yarn.io/gpu=2Ca
- n be used with TensorFlow or PyTorch or
- MXNet frameworks.
-```
-
-#### Notes:
-When using `localization` option to make a collection of dependency Python
-scripts available to entry python script in the container, you may also need to
-set `PYTHONPATH` environment variable as below to avoid module import error
-reported from `entry_script.py`.
-
-```shell
-... job run
- # the entry point
- --localization entry_script.py:<path>/entry_script.py
- # the dependency Python scripts of the entry point
- --localization other_scripts_dir:<path>/other_scripts_dir
- # the PYTHONPATH env to make dependency available to entry script
- --env PYTHONPATH="<path>/other_scripts_dir"
- --worker_launch_cmd "python <path>/entry_script.py ..."
-```
-
-<br />
-
-## Build From Source
-
-If you want to build the Submarine project by yourself, you can follow it
[here](../development/BuildFromCode.md)
diff --git a/docs/helper/RunningDistributedCifar10TFJobsWithYarnService.md
b/docs/helper/RunningDistributedCifar10TFJobsWithYarnService.md
deleted file mode 100644
index bc3957d..0000000
--- a/docs/helper/RunningDistributedCifar10TFJobsWithYarnService.md
+++ /dev/null
@@ -1,206 +0,0 @@
-<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-
-# Cifar10 Tensorflow Estimator Example With YARN Native Runtime
-
-## Prepare data for training
-
-CIFAR-10 is a common benchmark in machine learning for image recognition.
Below example is based on CIFAR-10 dataset.
-
-1) Checkout https://github.com/tensorflow/models/:
-```
-git clone https://github.com/tensorflow/models/
-```
-
-2) Go to `models/tutorials/image/cifar10_estimator`
-
-3) Generate data by using following command: (required Tensorflow installed)
-
-```
-python generate_cifar10_tfrecords.py --data-dir=cifar-10-data
-```
-
-4) Upload data to HDFS
-
-```
-hadoop fs -put cifar-10-data/ /dataset/cifar-10-data
-```
-
-**Warning:**
-
-Please note that YARN service doesn't allow multiple services with the same
name, so please run following command
-```
-yarn application -destroy <service-name>
-```
-to delete services if you want to reuse the same service name.
-
-## Prepare Docker images
-
-Refer to [Write Dockerfile](WriteDockerfileTF.md) to build a Docker image or
use prebuilt one:
-
-- hadoopsubmarine/tensorflow1.13.1-hadoop3.1.2-cpu:1.0.0
-- hadoopsubmarine/tensorflow1.13.1-hadoop3.1.2-gpu:1.0.0
-
-## Run Tensorflow jobs
-
-Set submarine.runtime.class to YarnServiceRuntimeFactory in submarine-site.xml.
-```
-<property>
- <name>submarine.runtime.class</name>
-
<value>org.apache.submarine.server.submitter.yarnservice.YarnServiceRuntimeFactory</value>
- <description>RuntimeFactory for Submarine jobs</description>
- </property>
-```
-The file, named submarine-site.xml, is in the path of ${SUBMARINE_HOME}/conf.
-
-### Run standalone training
-
-```
-SUBMARINE_VERSION=0.4.0-SNAPSHOT
-CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath
--glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
-${SUBMARINE_HOME}/conf: \
-java org.apache.submarine.client.cli.Cli job run \
- --name tf-job-001 --verbose --docker_image <image> \
- --input_path hdfs://default/dataset/cifar-10-data \
- --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
- --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current
- --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
- --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator &&
python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path%
--train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2
--sync" \
- --tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1
-```
-
-Explanations:
-
-- When access of HDFS is required, the two environments are required to
indicate: DOCKER_JAVA_HOME and DOCKER_HADOOP_HDFS_HOME to access libhdfs
libraries *inside Docker image*. We will try to eliminate specifying this in
the future.
-- Docker image for worker and tensorboard can be specified separately. For
this case, Tensorboard doesn't need GPU, so we will use cpu Docker image for
Tensorboard. (Same for parameter-server in the distributed example below).
-
-### Run distributed training
-
-```
-SUBMARINE_VERSION=0.4.0-SNAPSHOT
-CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath
--glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
-${SUBMARINE_HOME}/conf: \
-java org.apache.submarine.client.cli.Cli job run \
- --name tf-job-001 --verbose --docker_image tf-1.13.1-gpu:0.0.1 \
- --input_path hdfs://default/dataset/cifar-10-data \
- --env(s) (same as standalone)
- --num_workers 2 \
- --worker_resources memory=8G,vcores=2,gpu=1 \
- --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator &&
python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path%
--train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2
--sync" \
- --ps_docker_image tf-1.13.1-cpu:0.0.1 \
- --num_ps 1 --ps_resources memory=4G,vcores=2,gpu=0 \
- --ps_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator &&
python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path%
--num-gpus=0" \
- --tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1
-```
-
-Explanations:
-
-- `>1` num_workers indicates it is a distributed training.
-- Parameters / resources / Docker image of parameter server can be specified
separately. For many cases, parameter server doesn't require GPU.
-
-For the meaning of the individual parameters, see the
[QuickStart](QuickStart.md) page!
-
-*Outputs of distributed training*
-
-Sample output of master:
-```
-...
-allow_soft_placement: true
-, '_tf_random_seed': None, '_task_type': u'master', '_environment': u'cloud',
'_is_chief': True, '_cluster_spec':
<tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe77cb15050>,
'_tf_config': gpu_options {
- per_process_gpu_memory_fraction: 1.0
-}
-...
-2018-05-06 22:29:14.656022: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job master -> {0 -> localhost:8000}
-2018-05-06 22:29:14.656097: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job ps -> {0 ->
ps-0.distributed-tf.root.tensorflow.site:8000}
-2018-05-06 22:29:14.656112: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job worker -> {0 ->
worker-0.distributed-tf.root.tensorflow.site:8000}
-2018-05-06 22:29:14.659359: I
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server
with target: grpc://localhost:8000
-...
-INFO:tensorflow:Restoring parameters from
hdfs://default/tmp/cifar-10-jobdir/model.ckpt-0
-INFO:tensorflow:Evaluation [1/625]
-INFO:tensorflow:Evaluation [2/625]
-INFO:tensorflow:Evaluation [3/625]
-INFO:tensorflow:Evaluation [4/625]
-INFO:tensorflow:Evaluation [5/625]
-INFO:tensorflow:Evaluation [6/625]
-...
-INFO:tensorflow:Validation (step 1): loss = 1220.6445, global_step = 1,
accuracy = 0.1
-INFO:tensorflow:loss = 6.3980675, step = 0
-INFO:tensorflow:loss = 6.3980675, learning_rate = 0.1
-INFO:tensorflow:global_step/sec: 2.34092
-INFO:tensorflow:Average examples/sec: 1931.22 (1931.22), step = 100
-INFO:tensorflow:Average examples/sec: 354.236 (38.6479), step = 110
-INFO:tensorflow:Average examples/sec: 211.096 (38.7693), step = 120
-INFO:tensorflow:Average examples/sec: 156.533 (38.1633), step = 130
-INFO:tensorflow:Average examples/sec: 128.6 (38.7372), step = 140
-INFO:tensorflow:Average examples/sec: 111.533 (39.0239), step = 150
-```
-
-Sample output of worker:
-```
-, '_tf_random_seed': None, '_task_type': u'worker', '_environment': u'cloud',
'_is_chief': False, '_cluster_spec':
<tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc2a490b050>,
'_tf_config': gpu_options {
- per_process_gpu_memory_fraction: 1.0
-}
-...
-2018-05-06 22:28:45.807936: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job master -> {0 ->
master-0.distributed-tf.root.tensorflow.site:8000}
-2018-05-06 22:28:45.808040: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job ps -> {0 ->
ps-0.distributed-tf.root.tensorflow.site:8000}
-2018-05-06 22:28:45.808064: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job worker -> {0 -> localhost:8000}
-2018-05-06 22:28:45.809919: I
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server
with target: grpc://localhost:8000
-...
-INFO:tensorflow:loss = 5.319096, step = 0
-INFO:tensorflow:loss = 5.319096, learning_rate = 0.1
-INFO:tensorflow:Average examples/sec: 49.2338 (49.2338), step = 10
-INFO:tensorflow:Average examples/sec: 52.117 (55.3589), step = 20
-INFO:tensorflow:Average examples/sec: 53.2754 (55.7541), step = 30
-INFO:tensorflow:Average examples/sec: 53.8388 (55.6028), step = 40
-INFO:tensorflow:Average examples/sec: 54.1082 (55.2134), step = 50
-INFO:tensorflow:Average examples/sec: 54.3141 (55.3676), step = 60
-```
-
-Sample output of ps:
-```
-...
-, '_tf_random_seed': None, '_task_type': u'ps', '_environment': u'cloud',
'_is_chief': False, '_cluster_spec':
<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4be54dff90>,
'_tf_config': gpu_options {
- per_process_gpu_memory_fraction: 1.0
-}
-...
-2018-05-06 22:28:42.562316: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job master -> {0 ->
master-0.distributed-tf.root.tensorflow.site:8000}
-2018-05-06 22:28:42.562408: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job ps -> {0 -> localhost:8000}
-2018-05-06 22:28:42.562433: I
tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize
GrpcChannelCache for job worker -> {0 ->
worker-0.distributed-tf.root.tensorflow.site:8000}
-2018-05-06 22:28:42.564242: I
tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server
with target: grpc://localhost:8000
-```
-#### Notes:
-When using YARN native service runtime, you can view multiple job training
history like from the `Tensorboard` link:
-
-
-
-## Run tensorboard to monitor your jobs
-
-```shell
-# Cleanup previous tensorboard service if needed
-
-SUBMARINE_VERSION=0.4.0-SNAPSHOT
-CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath
--glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
-${SUBMARINE_HOME}/conf: \
-java org.apache.submarine.client.cli.Cli job run \
- --name tensorboard-service \
- --verbose \
- --docker_image <your-docker-image> \
- --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
- --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
- --checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
- --worker_resources memory=2G,vcores=2 \
- --worker_launch_cmd "pwd" \
- --tensorboard
-```
diff --git a/docs/helper/RunningSingleNodeCifar10PTJobsWithYarnService.md
b/docs/helper/RunningSingleNodeCifar10PTJobsWithYarnService.md
deleted file mode 100644
index 86c1ac2..0000000
--- a/docs/helper/RunningSingleNodeCifar10PTJobsWithYarnService.md
+++ /dev/null
@@ -1,72 +0,0 @@
-<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
- http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-
-# Tutorial: Running a standalone Cifar10 PyTorch Estimator Example.
-
-Currently, PyTorch integration with Submarine only supports PyTorch in
standalone (non-distributed mode).
-Please also note that HDFS as a data source is not yet supported by PyTorch.
-
-## What is CIFAR-10?
-CIFAR-10 is a common benchmark in machine learning for image recognition.
Below example is based on CIFAR-10 dataset.
-
-**Warning:**
-
-Please note that YARN service doesn't allow multiple services with the same
name, so please run following command
-```
-yarn application -destroy <service-name>
-```
-to delete services if you want to reuse the same service name.
-
-## Prepare Docker images
-
-Refer to [Write Dockerfile](WriteDockerfilePT.md) to build a Docker image or
use prebuilt one.
-
-## Running PyTorch jobs
-
-Set submarine.runtime.class to YarnServiceRuntimeFactory in submarine-site.xml.
-```
-<property>
- <name>submarine.runtime.class</name>
-
<value>org.apache.submarine.server.submitter.yarnservice.YarnServiceRuntimeFactory</value>
- <description>RuntimeFactory for Submarine jobs</description>
- </property>
-```
-The file, named submarine-site.xml, is in the path of ${SUBMARINE_HOME}/conf.
-
-### Run standalone training
-
-```shell
-SUBMARINE_VERSION=0.4.0-SNAPSHOT
-CLASSPATH=`${HADOOP_HOME}/bin/hadoop classpath
--glob`:${SUBMARINE_HOME}/submarine-all-${SUBMARINE_VERSION}.jar:
-${SUBMARINE_HOME}/conf: \
-java org.apache.submarine.client.cli.Cli job run \
---name pytorch-job-001 \
---verbose \
---framework pytorch \
---wait_job_finish \
---docker_image pytorch-latest-gpu:0.0.1 \
---input_path hdfs://unused \
---env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre \
---env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.2 \
---env YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true \
---num_workers 1 \
---worker_resources memory=5G,vcores=2 \
---worker_launch_cmd "cd /test/ && python cifar10_tutorial.py"
-```
-
-For the meaning of the individual parameters, see the
[QuickStart](QuickStart.md) page!
-
-**Remarks:**
-Please note that the input path parameter is mandatory, but not yet used by
the PyTorch docker container.
diff --git a/docs/user-guide-home.md b/docs/user-guide-home.md
new file mode 100644
index 0000000..9e272eb
--- /dev/null
+++ b/docs/user-guide-home.md
@@ -0,0 +1,31 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+# User Document of Submarine
+
+This is index of user-document of Submarine.
+
+## Build From Code
+
+[Build From Code Guide](./development/BuildFromCode.md)
+
+FIXME: Where's build guide for K8s? How can we make it clear?
+
+## Use Submarine on YARN
+
+Please refer to [Submarine On YARN](userdocs/yarn/README.md)
+
+## Use Submarine on K8s
+
+Please refer to [Submarine On K8s](userdocs/k8s/README.md)
diff --git a/docs/userdocs/k8s/README.md b/docs/userdocs/k8s/README.md
new file mode 100644
index 0000000..ecacdc1
--- /dev/null
+++ b/docs/userdocs/k8s/README.md
@@ -0,0 +1,41 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Running Submarine on K8s
+
+Submarine for K8s supports standalone distributed TensorFlow and PyTorch.
+
+Submarine can run on K8s >= (FIXME, version), supports features like GPU
isolation.
+
+## Submarine on K8s guide
+
+### Prepare K8s and deploy Submarine Service
+
+[Setup Kubernetes](setup-kubernetes.md): Submarine can be deployed on any K8s
environment if version matches. If you don't have a running K8s, you can follow
the steps to set up a K8s using [kind,
Kubernetes-in-Docker](https://kind.sigs.k8s.io/) for testing purpose.
+
+After you have an up-and-running K8s, you can follow [Deploy Submarine
Services on K8s](deploy-submarine.md) guide to deploy Submarine services on K8s
using Helmchart in minutes (FIXME: is it true?).
+
+### Use Submarine
+
+#### Model training (experiment) on K8s
+
+- [Run model training using Tensorflow](run-tensorflow-on-k8s.md)
+- [Run model training using PyTorch](FIXME, add one).
+
+## References
diff --git a/docs/submarine-server/ml-frameworks/README.md
b/docs/userdocs/k8s/deploy-submarine.md
similarity index 51%
rename from docs/submarine-server/ml-frameworks/README.md
rename to docs/userdocs/k8s/deploy-submarine.md
index a2b0b82..18d3112 100644
--- a/docs/submarine-server/ml-frameworks/README.md
+++ b/docs/userdocs/k8s/deploy-submarine.md
@@ -1,4 +1,4 @@
-<!--
+<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
@@ -17,6 +17,34 @@ specific language governing permissions and limitations
under the License.
-->
-# Machine Learning Framework
-Submarine 0.3.0 and above supports the training of TensorFlow jobs on
kubernetes by tf-operator, for more info see [here](./tensorflow.md).
+# Deploy Submarine On K8s
+
+## Deploy Submarine using Helm Chart (Recommended)
+
+Submarine's Helm Chart will not only deploy Submarine Server, but also deploys
TF Operator / PyTorch Operator (which will be used by Submarine Server to run
TF/PyTorch jobs on K8s).
+
+### Create images
+submarine server
+```bash
+./dev-support/docker-images/submarine/build.sh
+```
+
+submarine database
+```bash
+./dev-support/docker-images/database/build.sh
+```
+
+### install helm
+For more info see https://helm.sh/docs/intro/install/
+
+### Deploy Submarine Server, mysql
+You can modify some settings in ./helm-charts/submarine/values.yaml
+```bash
+helm install submarine ./helm-charts/submarine
+```
+
+### Delete deployment
+```bash
+helm delete submarine
+```
diff --git a/docs/submarine-server/README.md
b/docs/userdocs/k8s/run-tensorflow-on-k8s.md
similarity index 98%
rename from docs/submarine-server/README.md
rename to docs/userdocs/k8s/run-tensorflow-on-k8s.md
index 05bcd89..e99a8f6 100644
--- a/docs/submarine-server/README.md
+++ b/docs/userdocs/k8s/run-tensorflow-on-k8s.md
@@ -1,4 +1,4 @@
-<!--
+<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
@@ -26,7 +26,7 @@ It now supports Tensorflow and PyTorch jobs.
- A K8s cluster
- The Docker image encapsulated with your deep learning application code
-Note that We provide a learning and production environment tutorial. For more
deployment info see [Deploy Submarine Server on
Kubernetes](./setup-kubernetes.md).
+Note that We provide a learning and production environment tutorial. For more
deployment info see [Deploy Submarine Server on
Kubernetes](project/github/submarine/docs/userdocs/k8s/setup-kubernetes.md).
## Training
A generic job spec was designed for training job request, you should get
familiar with the the job spec before submit job.
@@ -114,7 +114,7 @@ or
}
```
-For more info about the spec definition see
[here](../design/submarine-server/jobspec.md).
+For more info about the spec definition see
[here](project/github/submarine/docs/design/submarine-server/jobspec.md).
## Job Operation by REST API
### Create Job
diff --git a/docs/submarine-server/setup-kubernetes.md
b/docs/userdocs/k8s/setup-kubernetes.md
similarity index 55%
rename from docs/submarine-server/setup-kubernetes.md
rename to docs/userdocs/k8s/setup-kubernetes.md
index 3eb8ed2..4aa00ba 100644
--- a/docs/submarine-server/setup-kubernetes.md
+++ b/docs/userdocs/k8s/setup-kubernetes.md
@@ -62,79 +62,3 @@ kubectl proxy
Now access Dashboard at:
> http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/
-
-### Setup Submarine
-
-#### Local
-
-##### Get package
-You can dowload submarine from releases page or build from source.
-
-##### Configuration
-Copy the kube config into `conf/k8s/config` or modify the
`conf/submarine-site.xml`:
-```
-<property>
- <name>submarine.k8s.kube.config</name>
- <value>PATH_TO_KUBE_CONFIG</value>
-</property>
-```
-
-##### Start Submarine Server
-Running the submarine server, executing the following command:
-```
-# if build from source. You need to run this under the target dir like
submarine-dist/target/submarine-dist-0.4.0-SNAPSHOT-hadoop-2.9/submarine-dist-0.4.0-SNAPSHOT-hadoop-2.9/
-./bin/submarine-daemon.sh start getMysqlJar
-```
-
-The REST API URL is: `http://127.0.0.1:8080/api/v1/jobs`
-
-#### Deploy Tensorflow Operator
-For more info see [deploy tensorflow operator](./ml-frameworks/tensorflow.md).
-
-#### Deploy PyTorch Operator
-```bash
-cd <submarine_code_path_root>/dev-support/k8s/pytorchjob
-./deploy-pytorch-operator.sh
-
-```
-
-
-### Use Helm Chart to deploy
-
-#### Create images
-submarine server
-```bash
-./dev-support/docker-images/submarine/build.sh
-```
-
-submarine database
-```bash
-./dev-support/docker-images/database/build.sh
-```
-
-#### install helm
-For more info see https://helm.sh/docs/intro/install/
-
-#### Deploy submarine server, mysql
-You can modify some settings in ./helm-charts/submarine/values.yaml
-```bash
-helm install submarine ./helm-charts/submarine
-```
-
-#### Delete deployment
-```bash
-helm delete submarine
-```
-
-#### port-forward {host port}:{container port}
-```bash
-kubectl port-forward svc/submarine-server 8080:8080 --address 0.0.0.0
-```
-
-## Production environment
-
-### Setup Kubernetes
-For more info see https://kubernetes.io/docs/setup/#production-environment
-
-### Setup Submarine
-It's will come soon.
diff --git a/docs/submarine-server/ml-frameworks/tensorflow.md
b/docs/userdocs/k8s/tensorflow.md
similarity index 100%
rename from docs/submarine-server/ml-frameworks/tensorflow.md
rename to docs/userdocs/k8s/tensorflow.md
diff --git a/docs/userdocs/yarn/Dockerfiles.md
b/docs/userdocs/yarn/Dockerfiles.md
new file mode 100644
index 0000000..1796d1e
--- /dev/null
+++ b/docs/userdocs/yarn/Dockerfiles.md
@@ -0,0 +1,22 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# Write Dockerfiles for Submarine
+
+[How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.md)
+
+[How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.md)
+
+[How to write Dockerfile for Submarine MXNet jobs](WriteDockerfileMX.md)
diff --git a/docs/userdocs/yarn/README.md b/docs/userdocs/yarn/README.md
new file mode 100644
index 0000000..9e1fd42
--- /dev/null
+++ b/docs/userdocs/yarn/README.md
@@ -0,0 +1,37 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# Running Submarine on YARN
+
+Submarine for YARN supports TensorFlow, PyTorch and MXNet framework. (Which is
leveraging [TonY](https://github.com/linkedin/TonY) created by Linkedin to run
deep learning training jobs on YARN.
+
+Submarine also supports GPU-on-YARN and Docker-on-YARN feature.
+
+Submarine can run on Hadoop 2.7.3 or later version, if GPU-on-YARN or
Docker-on-YARN feature is needed, newer Hadoop version is required, please
refer to the next section about what Hadoop version to choose.
+
+## Hadoop version
+
+Must:
+
+- Apache Hadoop version newer than 2.7.3
+
+Optional:
+
+- When you want to use GPU-on-YARN feature with Submarine, please make sure
Hadoop is at least 2.10.0+ (or 3.1.0+), and follow [Enable GPU on YARN
2.10.0+](https://hadoop.apache.org/docs/r2.10.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html)
to enable GPU-on-YARN feature.
+- When you want to run training jobs with Docker container, please make sure
Hadoop is at least 2.8.2, and follow [Enable Docker on YARN
2.8.2+](https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-site/DockerContainers.html)
to enable Docker-on-YARN feature.
+
+## Submarine YARN Runtime Guide
+
+[YARN Runtime Guide](YARNRuntimeGuide.md) talk about how to use Submarine to
run jobs on YARN, with Docker / without Docker.
diff --git a/docs/helper/TestAndTroubleshooting.md
b/docs/userdocs/yarn/TestAndTroubleshooting.md
similarity index 100%
rename from docs/helper/TestAndTroubleshooting.md
rename to docs/userdocs/yarn/TestAndTroubleshooting.md
diff --git a/docs/helper/WriteDockerfileMX.md
b/docs/userdocs/yarn/WriteDockerfileMX.md
similarity index 100%
rename from docs/helper/WriteDockerfileMX.md
rename to docs/userdocs/yarn/WriteDockerfileMX.md
diff --git a/docs/helper/WriteDockerfilePT.md
b/docs/userdocs/yarn/WriteDockerfilePT.md
similarity index 100%
rename from docs/helper/WriteDockerfilePT.md
rename to docs/userdocs/yarn/WriteDockerfilePT.md
diff --git a/docs/helper/WriteDockerfileTF.md
b/docs/userdocs/yarn/WriteDockerfileTF.md
similarity index 100%
rename from docs/helper/WriteDockerfileTF.md
rename to docs/userdocs/yarn/WriteDockerfileTF.md
diff --git a/docs/helper/TonYRuntimeGuide.md
b/docs/userdocs/yarn/YARNRuntimeGuide.md
similarity index 94%
rename from docs/helper/TonYRuntimeGuide.md
rename to docs/userdocs/yarn/YARNRuntimeGuide.md
index dd367c5..fafd545 100644
--- a/docs/helper/TonYRuntimeGuide.md
+++ b/docs/userdocs/yarn/YARNRuntimeGuide.md
@@ -13,11 +13,15 @@
limitations under the License.
-->
-# TonY Runtime Quick Start Guide
+# YARN Runtime Quick Start Guide
## Prerequisite
-Check out the [QuickStart](QuickStart.md)
+Check out the [Readme](README.md)
+
+## Build your own Docker image
+
+When you follow the documents below, and want to build your own Docker image
for Tensorflow/PyTorch/MXNet? Please check out [Build your Docker
image](Dockerfiles.md) for more details.
## Launch TensorFlow Application:
@@ -276,7 +280,7 @@ You should then be able to see links and status of the jobs
from command line:
```
### With Docker
-You could refer to this [sample
Dockerfile](docker/mxnet/cifar10/Dockerfile.cifar10.mx_1.5.1) for building your
own Docker image.
+You could refer to this [sample
Dockerfile](project/github/submarine/docs/userdocs/yarn/docker/mxnet/cifar10/Dockerfile.cifar10.mx_1.5.1)
for building your own Docker image.
```
SUBMARINE_VERSION=0.4.0-SNAPSHOT
SUBMARINE_HADOOP_VERSION=3.1
@@ -298,3 +302,9 @@ java org.apache.submarine.client.cli.Cli job run --name
MXNet-job-001 \
--insecure \
--conf
tony.containers.resources=path-to/image_classification.py,path-to/submarine-all-${SUBMARINE_VERSION}-hadoop-${SUBMARINE_HADOOP_VERSION}.jar
```
+
+## Use YARN Service to run Submarine: Deprecated
+
+Historically, Submarine supports to use [YARN
Service](https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html)
to submit deep learning jobs. Now we stop supporting it because YARN service
is not actively developed by community, and extra dependencies such as
RegistryDNS/ATS-v2 causes lots of issues for setup.
+
+As of now, you can still use YARN service to run Submarine, but code will be
removed in the future release. We will only support use TonY when use Submarine
on YARN.
diff --git
a/docs/helper/docker/mxnet/base/ubuntu-18.04/Dockerfile.cpu.mx_latest
b/docs/userdocs/yarn/docker/mxnet/base/ubuntu-18.04/Dockerfile.cpu.mx_latest
similarity index 100%
rename from docs/helper/docker/mxnet/base/ubuntu-18.04/Dockerfile.cpu.mx_latest
rename to
docs/userdocs/yarn/docker/mxnet/base/ubuntu-18.04/Dockerfile.cpu.mx_latest
diff --git
a/docs/helper/docker/mxnet/base/ubuntu-18.04/Dockerfile.gpu.mx_latest
b/docs/userdocs/yarn/docker/mxnet/base/ubuntu-18.04/Dockerfile.gpu.mx_latest
similarity index 100%
rename from docs/helper/docker/mxnet/base/ubuntu-18.04/Dockerfile.gpu.mx_latest
rename to
docs/userdocs/yarn/docker/mxnet/base/ubuntu-18.04/Dockerfile.gpu.mx_latest
diff --git a/docs/helper/docker/mxnet/build-all.sh
b/docs/userdocs/yarn/docker/mxnet/build-all.sh
similarity index 100%
rename from docs/helper/docker/mxnet/build-all.sh
rename to docs/userdocs/yarn/docker/mxnet/build-all.sh
diff --git a/docs/helper/docker/mxnet/cifar10/Dockerfile.cifar10.mx_1.5.1
b/docs/userdocs/yarn/docker/mxnet/cifar10/Dockerfile.cifar10.mx_1.5.1
similarity index 100%
rename from docs/helper/docker/mxnet/cifar10/Dockerfile.cifar10.mx_1.5.1
rename to docs/userdocs/yarn/docker/mxnet/cifar10/Dockerfile.cifar10.mx_1.5.1
diff --git
a/docs/helper/docker/pytorch/base/ubuntu-18.04/Dockerfile.gpu.pytorch_latest
b/docs/userdocs/yarn/docker/pytorch/base/ubuntu-18.04/Dockerfile.gpu.pytorch_latest
similarity index 100%
rename from
docs/helper/docker/pytorch/base/ubuntu-18.04/Dockerfile.gpu.pytorch_latest
rename to
docs/userdocs/yarn/docker/pytorch/base/ubuntu-18.04/Dockerfile.gpu.pytorch_latest
diff --git a/docs/helper/docker/pytorch/build-all.sh
b/docs/userdocs/yarn/docker/pytorch/build-all.sh
similarity index 100%
rename from docs/helper/docker/pytorch/build-all.sh
rename to docs/userdocs/yarn/docker/pytorch/build-all.sh
diff --git a/docs/helper/docker/pytorch/with-cifar10-models/cifar10_tutorial.py
b/docs/userdocs/yarn/docker/pytorch/with-cifar10-models/cifar10_tutorial.py
similarity index 100%
rename from docs/helper/docker/pytorch/with-cifar10-models/cifar10_tutorial.py
rename to
docs/userdocs/yarn/docker/pytorch/with-cifar10-models/cifar10_tutorial.py
diff --git
a/docs/helper/docker/pytorch/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.pytorch_latest
b/docs/userdocs/yarn/docker/pytorch/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.pytorch_latest
similarity index 100%
rename from
docs/helper/docker/pytorch/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.pytorch_latest
rename to
docs/userdocs/yarn/docker/pytorch/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.pytorch_latest
diff --git
a/docs/helper/docker/tensorflow/base/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1
b/docs/userdocs/yarn/docker/tensorflow/base/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1
similarity index 100%
rename from
docs/helper/docker/tensorflow/base/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1
rename to
docs/userdocs/yarn/docker/tensorflow/base/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1
diff --git
a/docs/helper/docker/tensorflow/base/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1
b/docs/userdocs/yarn/docker/tensorflow/base/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1
similarity index 100%
rename from
docs/helper/docker/tensorflow/base/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1
rename to
docs/userdocs/yarn/docker/tensorflow/base/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1
diff --git a/docs/helper/docker/tensorflow/build-all.sh
b/docs/userdocs/yarn/docker/tensorflow/build-all.sh
similarity index 100%
rename from docs/helper/docker/tensorflow/build-all.sh
rename to docs/userdocs/yarn/docker/tensorflow/build-all.sh
diff --git
a/docs/helper/docker/tensorflow/mnist/Dockerfile.tony.tf.mnist.tf_1.13.1
b/docs/userdocs/yarn/docker/tensorflow/mnist/Dockerfile.tony.tf.mnist.tf_1.13.1
similarity index 100%
rename from
docs/helper/docker/tensorflow/mnist/Dockerfile.tony.tf.mnist.tf_1.13.1
rename to
docs/userdocs/yarn/docker/tensorflow/mnist/Dockerfile.tony.tf.mnist.tf_1.13.1
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.cpu.tf_1.13.1
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/Dockerfile.gpu.tf_1.13.1
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README.md
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README.md
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README.md
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/README.md
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10.py
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10.py
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10.py
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10.py
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_main.py
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_main.py
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_main.py
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_main.py
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_model.py
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_model.py
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_model.py
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_model.py
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_utils.py
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_utils.py
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_utils.py
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/cifar10_utils.py
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/generate_cifar10_tfrecords.py
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/generate_cifar10_tfrecords.py
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/generate_cifar10_tfrecords.py
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/generate_cifar10_tfrecords.py
diff --git
a/docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/model_base.py
b/docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/model_base.py
similarity index 100%
rename from
docs/helper/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/model_base.py
rename to
docs/userdocs/yarn/docker/tensorflow/with-cifar10-models/ubuntu-18.04/cifar10_estimator_tf_1.13.1/model_base.py
diff --git
a/docs/helper/docker/tensorflow/zeppelin-notebook-example/Dockerfile.gpu
b/docs/userdocs/yarn/docker/tensorflow/zeppelin-notebook-example/Dockerfile.gpu
similarity index 100%
rename from
docs/helper/docker/tensorflow/zeppelin-notebook-example/Dockerfile.gpu
rename to
docs/userdocs/yarn/docker/tensorflow/zeppelin-notebook-example/Dockerfile.gpu
diff --git
a/docs/helper/docker/tensorflow/zeppelin-notebook-example/run_container.sh
b/docs/userdocs/yarn/docker/tensorflow/zeppelin-notebook-example/run_container.sh
similarity index 100%
rename from
docs/helper/docker/tensorflow/zeppelin-notebook-example/run_container.sh
rename to
docs/userdocs/yarn/docker/tensorflow/zeppelin-notebook-example/run_container.sh
diff --git a/docs/helper/docker/tensorflow/zeppelin-notebook-example/shiro.ini
b/docs/userdocs/yarn/docker/tensorflow/zeppelin-notebook-example/shiro.ini
similarity index 100%
rename from docs/helper/docker/tensorflow/zeppelin-notebook-example/shiro.ini
rename to
docs/userdocs/yarn/docker/tensorflow/zeppelin-notebook-example/shiro.ini
diff --git
a/docs/helper/docker/tensorflow/zeppelin-notebook-example/zeppelin-site.xml
b/docs/userdocs/yarn/docker/tensorflow/zeppelin-notebook-example/zeppelin-site.xml
similarity index 100%
rename from
docs/helper/docker/tensorflow/zeppelin-notebook-example/zeppelin-site.xml
rename to
docs/userdocs/yarn/docker/tensorflow/zeppelin-notebook-example/zeppelin-site.xml
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]