YARN-8875. [Submarine] Add documentation for submarine installation script details. (Xun Liu via wangda)
Change-Id: I1c8d39c394e5a30f967ea514919835b951f2c124 Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/ed08dd3b Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/ed08dd3b Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/ed08dd3b Branch: refs/heads/HDDS-4 Commit: ed08dd3b0c9cec20373e8ca4e34d6526bd759943 Parents: babd144 Author: Wangda Tan <wan...@apache.org> Authored: Tue Oct 16 13:36:09 2018 -0700 Committer: Wangda Tan <wan...@apache.org> Committed: Tue Oct 16 13:51:01 2018 -0700 ---------------------------------------------------------------------- .../src/site/markdown/HowToInstall.md | 36 +++ .../src/site/markdown/Index.md | 4 +- .../src/site/markdown/InstallationGuide.md | 205 +++------------ .../src/site/markdown/InstallationScriptCN.md | 242 ++++++++++++++++++ .../src/site/markdown/InstallationScriptEN.md | 250 +++++++++++++++++++ .../src/site/markdown/TestAndTroubleshooting.md | 165 ++++++++++++ .../resources/images/submarine-installer.gif | Bin 0 -> 546547 bytes 7 files changed, 724 insertions(+), 178 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/hadoop/blob/ed08dd3b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/HowToInstall.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/HowToInstall.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/HowToInstall.md new file mode 100644 index 0000000..05d87c1 --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/HowToInstall.md @@ -0,0 +1,36 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# How to Install Dependencies + +Submarine project uses YARN Service, Docker container, and GPU (when GPU hardware available and properly configured). + +That means as an admin, you have to properly setup YARN Service related dependencies, including: +- YARN Registry DNS + +Docker related dependencies, including: +- Docker binary with expected versions. +- Docker network which allows Docker container can talk to each other across different nodes. + +And when GPU wanna to be used: +- GPU Driver. +- Nvidia-docker. + +For your convenience, we provided installation documents to help you to setup your environment. You can always choose to have them installed in your own way. + +Use Submarine installer to install dependencies: [EN](InstallationScriptEN.html) [CN](InstallationScriptCN.html) + +Alternatively, you can follow manual install dependencies: [EN](InstallationGuide.html) [CN](InstallationGuideChineseVersion.html) + +Once you have installed dependencies, please follow following guide to [TestAndTroubleshooting](TestAndTroubleshooting.html). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/hadoop/blob/ed08dd3b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md index 0006f6c..baeaa15 100644 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md @@ -41,6 +41,4 @@ Click below contents if you want to understand more. - [Developer guide](DeveloperGuide.html) -- [Installation guide](InstallationGuide.html) - -- [Installation guide Chinese version](InstallationGuideChineseVersion.html) +- [Installation guides](HowToInstall.html) http://git-wip-us.apache.org/repos/asf/hadoop/blob/ed08dd3b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md index d4f4269..4ef2bda 100644 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md @@ -16,9 +16,11 @@ ## Prerequisites +(Please note that all following prerequisites are just an example for you to install. You can always choose to install your own version of kernel, different users, different drivers, etc.). + ### Operating System -The operating system and kernel versions we used are as shown in the following table, which should be minimum required versions: +The operating system and kernel versions we have tested are as shown in the following table, which is the recommneded minimum required versions. | Enviroment | Verion | | ------ | ------ | @@ -27,7 +29,7 @@ The operating system and kernel versions we used are as shown in the following t ### User & Group -As there are some specific users and groups need to be created to install hadoop/docker. Please create them if they are missing. +As there are some specific users and groups recommended to be created to install hadoop/docker. Please create them if they are missing. ``` adduser hdfs @@ -45,7 +47,7 @@ usermod -aG docker hadoop ### GCC Version -Check the version of GCC tool +Check the version of GCC tool (to compile kernel). ```bash gcc --version @@ -64,7 +66,7 @@ wget http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-5 rpm -ivh kernel-headers-3.10.0-514.el7.x86_64.rpm ``` -### GPU Servers +### GPU Servers (Only for Nvidia GPU equipped nodes) ``` lspci | grep -i nvidia @@ -76,9 +78,9 @@ lspci | grep -i nvidia -### Nvidia Driver Installation +### Nvidia Driver Installation (Only for Nvidia GPU equipped nodes) -If nvidia driver/cuda has been installed before, They should be uninstalled firstly. +To make a clean installation, if you have requirements to upgrade GPU drivers. If nvidia driver/cuda has been installed before, They should be uninstalled firstly. ``` # uninstall cudaï¼ @@ -96,16 +98,16 @@ yum install nvidia-detect nvidia-detect -v Probing for supported NVIDIA devices... [10de:13bb] NVIDIA Corporation GM107GL [Quadro K620] -This device requires the current 390.87 NVIDIA driver kmod-nvidia +This device requires the current xyz.nm NVIDIA driver kmod-nvidia [8086:1912] Intel Corporation HD Graphics 530 An Intel display controller was also detected ``` -Pay attention to `This device requires the current 390.87 NVIDIA driver kmod-nvidia`. -Download the installer [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html). +Pay attention to `This device requires the current xyz.nm NVIDIA driver kmod-nvidia`. +Download the installer like [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html). -Some preparatory work for nvidia driver installation +Some preparatory work for nvidia driver installation. (This is follow normal Nvidia GPU driver installation, just put here for your convenience) ``` # It may take a while to update @@ -163,6 +165,8 @@ https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html ### Docker Installation +We recommend to use Docker version >= 1.12.5, following steps are just for your reference. You can always to choose other approaches to install Docker. + ``` yum -y update yum -y install yum-utils @@ -226,9 +230,9 @@ Server: OS/Arch: linux/amd64 ``` -### Nvidia-docker Installation +### Nvidia-docker Installation (Only for Nvidia GPU equipped nodes) -Submarine is based on nvidia-docker 1.0 version +Submarine depends on nvidia-docker 1.0 version ``` wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm @@ -285,7 +289,6 @@ Reference: https://github.com/NVIDIA/nvidia-docker/tree/1.0 - ### Tensorflow Image There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. we can get basic docker images by following WriteDockerfile.md. @@ -367,7 +370,7 @@ ENV PATH $PATH:$JAVA_HOME/bin ### Test tensorflow in a docker container After docker image is built, we can check -tensorflow environments before submitting a yarn job. +Tensorflow environments before submitting a yarn job. ```shell $ docker run -it ${docker_image_name} /bin/bash @@ -394,10 +397,13 @@ If there are some errors, we could check the following configuration. ### Etcd Installation -To install Etcd on specified servers, we can run Submarine/install.sh +etcd is a distributed reliable key-value store for the most critical data of a distributed system, Registration and discovery of services used in containers. +You can also choose alternatives like zookeeper, Consul. + +To install Etcd on specified servers, we can run Submarine-installer/install.sh ```shell -$ ./Submarine/install.sh +$ ./Submarine-installer/install.sh # Etcd status systemctl status Etcd.service ``` @@ -421,7 +427,10 @@ b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURL ### Calico Installation -To install Calico on specified servers, we can run Submarine/install.sh +Calico creates and manages a flat three-tier network, and each container is assigned a routable ip. We just add the steps here for your convenience. +You can also choose alternatives like Flannel, OVS. + +To install Calico on specified servers, we can run Submarine-installer/install.sh ``` systemctl start calico-node.service @@ -460,11 +469,8 @@ docker exec workload-A ping workload-B ## Hadoop Installation -### Compile hadoop source code - -``` -mvn package -Pdist -DskipTests -Dtar -``` +### Get Hadoop Release +You can either get Hadoop release binary or compile from source code. Please follow the https://hadoop.apache.org/ guides. ### Start yarn service @@ -593,10 +599,10 @@ Add configurations in container-executor.cfg ... # Add configurations in `[docker]` partï¼ # /usr/bin/nvidia-docker is the path of nvidia-docker command - # nvidia_driver_375.26 means that nvidia driver version is 375.26. nvidia-smi command can be used to check the version + # nvidia_driver_375.26 means that nvidia driver version is <version>. nvidia-smi command can be used to check the version docker.allowed.volume-drivers=/usr/bin/nvidia-docker docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0 - docker.allowed.ro-mounts=nvidia_driver_375.26 + docker.allowed.ro-mounts=nvidia_driver_<version> [gpu] module.enabled=true @@ -607,154 +613,3 @@ Add configurations in container-executor.cfg root=/sys/fs/cgroup yarn-hierarchy=/hadoop-yarn ``` - -#### Test with a tensorflow job - -Distributed-shell + GPU + cgroup - -```bash - ./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ - --env DOCKER_JAVA_HOME=/opt/java \ - --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ - --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ - --docker_image gpu-cuda9.0-tf1.8.0-with-models \ - --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \ - --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \ - --num_ps 0 \ - --ps_resources memory=4G,vcores=2,gpu=0 \ - --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \ - --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ - --num_workers 1 \ - --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" -``` - - - -## Issues: - -### Issue 1: Fail to start nodemanager after system reboot - -``` -2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems! -org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn - at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425) - at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377) - at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98) - at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87) - at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58) - at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320) - at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389) - at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) - at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) - at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) -2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED -``` - -Solution: Grant user yarn the access to `/sys/fs/cgroup/cpu,cpuacct`, which is the subfolder of cgroup mount destination. - -``` -chown :yarn -R /sys/fs/cgroup/cpu,cpuacct -chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct -``` - -If GPUs are usedï¼the access to cgroup devices folder is neede as well - -``` -chown :yarn -R /sys/fs/cgroup/devices -chmod g+rwx -R /sys/fs/cgroup/devices -``` - - -### Issue 2: container-executor permission denied - -``` -2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command: -java.io.IOException: Cannot run program "/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied - at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) - at org.apache.hadoop.util.Shell.runCommand(Shell.java:938) - at org.apache.hadoop.util.Shell.run(Shell.java:901) - at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) -``` - -Solution: The permission of `/etc/yarn/sbin/Linux-amd64-64/container-executor` should be 6050 - -### Issue 3ï¼How to get docker service log - -Solution: we can get docker log with the following command - -``` -journalctl -u docker -``` - -### Issue 4ï¼docker can't remove containers with errors like `device or resource busy` - -```bash -$ docker rm 0bfafa146431 -Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy -``` - -Solution: to find which process leads to a `device or resource busy`, we can add a shell script, named `find-busy-mnt.sh` - -```bash -#!/bin/bash - -# A simple script to get information about mount points and pids and their -# mount namespaces. - -if [ $# -ne 1 ];then -echo "Usage: $0 <devicemapper-device-id>" -exit 1 -fi - -ID=$1 - -MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null` - -[ -z "$MOUNTS" ] && echo "No pids found" && exit 0 - -printf "PID\tNAME\t\tMNTNS\n" -echo "$MOUNTS" | while read LINE; do -PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3` -# Ignore self and thread-self -if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then - continue -fi -NAME=`ps -q $PID -o comm=` -MNTNS=`readlink /proc/$PID/ns/mnt` -printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS" -done -``` - -Kill the process by pid, which is found by the script - -```bash -$ chmod +x find-busy-mnt.sh -./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a -# PID NAME MNTNS -# 5007 ntpd mnt:[4026533598] -$ kill -9 5007 -``` - - -### Issue 5ï¼Failed to execute `sudo nvidia-docker run` - -``` -docker: Error response from daemon: create nvidia_driver_361.42: VolumeDriver.Create: internal error, check logs for details. -See 'docker run --help'. -``` - -Solution: - -``` -#check nvidia-docker status -$ systemctl status nvidia-docker -$ journalctl -n -u nvidia-docker -#restart nvidia-docker -systemctl stop nvidia-docker -systemctl start nvidia-docker -``` - -### Issue 6ï¼Yarn failed to start containers - -if the number of GPUs required by applications is larger than the number of GPUs in the cluster, there would be some containers can't be created. - http://git-wip-us.apache.org/repos/asf/hadoop/blob/ed08dd3b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptCN.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptCN.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptCN.md new file mode 100644 index 0000000..8a873c4 --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptCN.md @@ -0,0 +1,242 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# submarine installer + +## 项ç®ä»ç» + +ä»ç» **submarine-installer** 项ç®ä¹åï¼é¦å è¦è¯´æä¸ä¸ **Hadoop {Submarine}** è¿ä¸ªé¡¹ç®ï¼**Hadoop {Submarine}** æ¯ hadoop 3.2 çæ¬ä¸ææ°åå¸çæºå¨å¦ä¹ æ¡æ¶å项ç®ï¼ä»è®© hadoop æ¯æ `Tensorflow`ã`MXNet`ã`Caffe`ã`Spark` çå¤ç§æ·±åº¦å¦ä¹ æ¡æ¶ï¼æä¾äºæºå¨å¦ä¹ ç®æ³å¼åãåå¸å¼æ¨¡åè®ç»ã模å管ç忍¡ååå¸çå ¨åè½çç³»ç»æ¡æ¶ï¼ç»å hadoop ä¸èº«ä¿±æ¥çæ°æ®åå¨åæ°æ®å¤çè½åï¼è®©æ°æ®ç§å¦å®¶ä»¬è½å¤æ´å¥½çææååæ¥åºæ°æ®çä»·å¼ã + +hadoop å¨ 2.9 çæ¬ä¸å°±å·²ç»è®© YARN æ¯æäº Docker 容å¨çèµæºè°åº¦æ¨¡å¼ï¼**Hadoop {Submarine}** 卿¤åºç¡ä¹ä¸éè¿ YARN æåå¸å¼æ·±åº¦å¦ä¹ æ¡æ¶ä»¥ Docker 容å¨çæ¹å¼è¿è¡è°åº¦åè¿è¡èµ·æ¥ã + +ç±äºåå¸å¼æ·±åº¦å¦ä¹ æ¡æ¶éè¦è¿è¡å¨å¤ä¸ª Docker ç容å¨ä¹ä¸ï¼å¹¶ä¸éè¦è½å¤è®©è¿è¡å¨å®¹å¨ä¹ä¸çå个æå¡ç¸äºåè°ï¼å®æåå¸å¼æºå¨å¦ä¹ çæ¨¡åè®ç»å模ååå¸çæå¡ï¼è¿å ¶ä¸å°±ä¼çµæ¶å° `DNS`ã`Docker` ã `GPU`ã`Network`ã`æ¾å¡`ã`æä½ç³»ç»å æ ¸` ä¿®æ¹çå¤ä¸ªç³»ç»å·¥ç¨é®é¢ï¼æ£ç¡®çé¨ç½²å¥½ **Hadoop {Submarine}** çè¿è¡ç¯å¢æ¯ä¸ä»¶å¾å°é¾åèæ¶çäºæ ã + +为äºéä½ hadoop 2.9 以ä¸çæ¬ç docker çç»ä»¶çé¨ç½²é¾åº¦ï¼æä»¥æä»¬ä¸é¨å¼åäºè¿ä¸ªç¨æ¥é¨ç½² `Hadoop {Submarine} ` è¿è¡æ¶ç¯å¢ç `submarine-installer` 项ç®ï¼æä¾ä¸é®å®è£ èæ¬ï¼ä¹å¯ä»¥åæ¥æ§è¡å®è£ ãå¸è½½ãå¯å¨å忢å个ç»ä»¶ï¼åæ¶è®²è§£æ¯ä¸æ¥ä¸»è¦åæ°é ç½®åæ³¨æäºé¡¹ãæä»¬åæ¶è¿å hadoop ç¤¾åºæäº¤äºé¨ç½² `Hadoop {Submarine} ` è¿è¡æ¶ç¯å¢ç [䏿æå](InstallationGuideChineseVersion.md) å [è±ææå](InstallationGuide.md) ï¼å¸®å©ç¨æ·æ´å®¹æçé¨ç½²ï¼åç°é®é¢ä¹å¯ä»¥åæ¶è§£å³ã + +## å 峿¡ä»¶ + +**submarine-installer** ç®ååªæ¯æ `centos-release-7-3.1611.el7.centos.x86_64` 以ä¸çæ¬çæä½ç³»ç»ä¸è¿è¡ä½¿ç¨ã + +## é 置说æ + +ä½¿ç¨ **submarine-installer** è¿è¡é¨ç½²ä¹åï¼ä½ å¯ä»¥åè [install.conf](install.conf) æä»¶ä¸å·²æçé ç½®åæ°åæ ¼å¼ï¼æ ¹æ®ä½ çä½¿ç¨æ åµè¿è¡å¦ä¸çåæ°é ç½®ï¼ + ++ **DNS é 置项** + + LOCAL_DNS_HOSTï¼æå¡å¨ç«¯æ¬å° DNS IP å°åé ç½®ï¼å¯ä»¥ä» `/etc/resolv.conf` 䏿¥ç + + YARN_DNS_HOSTï¼yarn dns server å¯å¨ç IP å°å + ++ **ETCD é 置项** + + æºå¨å¦ä¹ æ¯ä¸ä¸ªè®¡ç®å¯åº¦åç³»ç»ï¼å¯¹æ°æ®ä¼ è¾æ§è½è¦æ±é常é«ï¼æä»¥æä»¬ä½¿ç¨äºç½ç»æçæèæå°ç ETCD ç½ç»ç»ä»¶ï¼å®å¯ä»¥éè¿ BGP è·¯ç±æ¹å¼æ¯æ overlay ç½ç»ï¼åæ¶å¨è·¨æºæ¿é¨ç½²æ¶æ¯æé§é模å¼ã + + ä½ éè¦éæ©è³å°ä¸å°ä»¥ä¸çæå¡å¨ä½ä¸º ETCD çè¿è¡æå¡å¨ï¼è¿æ ·å¯ä»¥è®© `Hadoop {Submarine} ` æè¾å¥½ç容鿧åç¨³å®æ§ã + + å¨ **ETCD_HOSTS** é 置项ä¸è¾å ¥ä½ä¸º ETCD æå¡å¨çIPæ°ç»ï¼åæ°é ç½®ä¸è¬æ¯è¿æ ·ï¼ + + ETCD_HOSTS=(hostIP1 hostIP2 hostIP3)ï¼æ³¨æå¤ä¸ª hostIP ä¹é´è¯·ä½¿ç¨ç©ºæ ¼è¿è¡éå¼ã + ++ **DOCKER_REGISTRY é 置项** + + ä½ é¦å éè¦å®è£ 好ä¸ä¸ªå¯ç¨ç docker çéå管çä»åºï¼è¿ä¸ªéåä»åºç¨æ¥åæ¾ä½ æéè¦çåç§æ·±åº¦å¦ä¹ æ¡æ¶çéåæä»¶ï¼ç¶åå°éåä»åºç IP å°åå端å£é ç½®è¿æ¥ï¼åæ°é ç½®ä¸è¬æ¯è¿æ ·ï¼DOCKER_REGISTRY="10.120.196.232:5000" + ++ **DOWNLOAD_SERVER é 置项** + + `submarine-installer` é»è®¤é½æ¯ä»ç½ç»ä¸ç´æ¥ä¸è½½ææçä¾èµå ï¼ä¾å¦ï¼GCCãDockerãNvidia 驱å¨ççï¼ï¼è¿å¾å¾éè¦æ¶è大éçæ¶é´ï¼å¹¶ä¸å¨æäºæå¡å¨ä¸è½è¿æ¥äºèç½çç¯å¢ä¸å°æ æ³é¨ç½²ï¼æä»¥æä»¬å¨ `submarine-installer` ä¸å ç½®äº HTTP ä¸è½½æå¡ï¼åªéè¦å¨ä¸å°è½å¤è¿æ¥äºèç½çæå¡å¨ä¸è¿è¡ `submarine-installer` ï¼å°±å¯ä»¥ä¸ºææå ¶ä»æå¡å¨æä¾ä¾èµå çä¸è½½ï¼åªéè¦ä½ æç §ä»¥ä¸é ç½®è¿è¡æä½ï¼ + + 1. é¦å ï¼ä½ éè¦å° `DOWNLOAD_SERVER_IP` é 置为ä¸å°è½å¤è¿æ¥äºèç½çæå¡å¨IPå°åï¼å° `DOWNLOAD_SERVER_PORT` é 置为ä¸ä¸ªä¸ä¼ä¸å¤ªå¸¸ç¨ç端å£ã + 2. å¨ `DOWNLOAD_SERVER_IP` æå¨çé£å°æå¡å¨ä¸è¿è¡ `submarine-installer/install.sh` å½ä»¤åï¼å¨å®è£ çé¢ä¸éæ© `[start download server]` èå项ï¼`submarine-installer` å°ä¼æé¨ç½²ææçä¾èµå å ¨é¨ä¸è½½å° `submarine-installer/downloads` ç®å½ä¸ï¼ç¶åéè¿ `python -m SimpleHTTPServer ${DOWNLOAD_SERVER_PORT}` å½ä»¤å¯å¨ä¸ä¸ª HTTP ä¸è½½æå¡ï¼ä¸è¦å ³éè¿å°æå¡å¨ä¸è¿è¡çç `submarine-installer` ã + 3. å¨å ¶ä»æå¡å¨ä¸åæ ·è¿è¡ `submarine-installer/install.sh` å½ä»¤ ï¼æç §å®è£ çé¢ä¸ç `[install component]` èå便¬¡è¿è¡å个ç»ä»¶çå®è£ æ¶ï¼ä¼èªå¨ä» `DOWNLOAD_SERVER_IP` æå¨çé£å°æå¡å¨ä¸è½½ä¾èµå è¿è¡å®è£ é¨ç½²ã + 4. **DOWNLOAD_SERVER** å¦å¤è¿æä¸ä¸ªç¨å¤æ¯ï¼ä½ å¯ä»¥èªè¡æå个ä¾èµå æå·¥ä¸è½½ä¸æ¥ï¼ç¶åæ¾å°å ¶ä¸ä¸å°æå¡å¨ç `submarine-installer/downloads` ç®å½ä¸ï¼ç¶åå¼å¯ `[start download server]` ï¼è¿æ ·å°±å¯ä»¥ä¸ºæ´ä¸ªé群æä¾ç¦»çº¿å®è£ é¨ç½²çè½åã + ++ **YARN_CONTAINER_EXECUTOR_PATH é 置项** + + å¦ä½ç¼è¯ YARN ç container-executorï¼ä½ è¿å ¥å° `hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager` ç®å½ä¸æ§è¡ `mvn package -Pnative -DskipTests` å½ä»¤ï¼å°ä¼ç¼è¯åº `./target/native/target/usr/local/bin/container-executor` æä»¶ã + + ä½ éè¦å° `container-executor` æä»¶ç宿´è·¯å¾å¡«åå¨ YARN_CONTAINER_EXECUTOR_PATH é 置项ä¸ã + ++ **YARN_HIERARCHY é 置项** + + è¯·ä¿æåä½ æä½¿ç¨ç YARN é群ç `yarn-site.xml` é ç½®æä»¶ä¸ç `yarn.nodemanager.linux-container-executor.cgroups.hierarchy` ç¸åçé ç½®ï¼`yarn-site.xml` ä¸å¦ææªé 置该项ï¼é£ä¹é»è®¤ä¸º `/hadoop-yarn`ã + ++ **YARN_NODEMANAGER_LOCAL_DIRS é 置项** + + è¯·ä¿æåä½ æä½¿ç¨ç YARN é群ç `yarn-site.xml` é ç½®æä»¶ä¸ç `yarn.nodemanager.local-dirs` ç¸åçé ç½®ã + ++ **YARN_NODEMANAGER_LOG_DIRS é 置项** + + è¯·ä¿æåä½ æä½¿ç¨ç YARN é群ç `yarn-site.xml` é ç½®æä»¶ä¸ç `yarn.nodemanager.log-dirs` ç¸åçé ç½®ã + +## 使ç¨è¯´æ + +**submarine-installer** å®å ¨ä½¿ç¨ Shell èæ¬ç¼åï¼ä¸éè¦å®è£ ansible çä»»ä½é¨ç½²å·¥å ·ï¼é¿å äºä¸åå ¬å¸ç¨æ·çæå¡å¨ç®¡çè§èä¸åè导è´ç¨åºä¸éç¨ï¼ä¾å¦ï¼æäºæºæ¿æ¯ä¸å®¹è®¸ ROOT ç¨æ·éè¿ SHELL ç´æ¥è¿è¡è¿ç¨æå¡å¨æä½çã + +**submarine-installer** çé¨ç½²è¿ç¨ï¼å®å ¨æ¯éè¿å¨èåä¸è¿è¡éæ©çæä½æ¹å¼è¿è¡çï¼é¿å äºè¯¯æä½çåæ¶ï¼ä½ è¿å¯ä»¥éè¿å个èå项ç®å¯¹ä»»æä¸ä¸ªç»ä»¶è¿è¡åæ¥æ§è¡å®è£ ãå¸è½½ãå¯å¨å忢å个ç»ä»¶ï¼å ·æå¾å¥½ççµæ´»æ§ï¼å¨é¨åç»ä»¶åºç°é®é¢åï¼ä¹å¯ä»¥éè¿ **submarine-installer** 对系ç»è¿è¡è¯æåä¿®å¤ã + +**submarine-installer** é¨ç½²è¿ç¨ä¸å±å¹ä¸ä¼æ¾ç¤ºæ¥å¿ä¿¡æ¯ï¼æ¥å¿ä¿¡æ¯ä¸å ±æä¸ç§åä½é¢è²ï¼ + ++ 红è²åä½é¢è²ï¼è¯´æç»ä»¶å®è£ åºç°äºé误ï¼é¨ç½²å·²ç»ç»æ¢ã + ++ ç»¿è²æåé¢è²ï¼è¯´æç»ä»¶å®è£ æ£å¸¸ï¼é¨ç½²æ£å¸¸è¿è¡ã + ++ èè²æåé¢è²ï¼éè¦ä½ æç §æç¤ºä¿¡æ¯å¨å¦å¤ä¸ä¸ª SHELL ç»ç«¯ä¸è¿è¡æå·¥è¾å ¥å½ä»¤ï¼ä¸è¬æ¯ä¿®æ¹æä½ç³»ç»å æ ¸é ç½®æä½ï¼æç §æç¤ºä¿¡æ¯ä¾æ¬¡æä½å°±å¯ä»¥äºã + +**å¯å¨ submarine-installer** + +è¿è¡ `submarine-installer/install.sh` å½ä»¤å¯å¨ï¼é¨ç½²ç¨åºé¦å 伿£æµæå¡å¨ä¸çç½å¡ IP å°åï¼å¦ææå¡å¨æå¤ä¸ªç½å¡æé ç½®äºå¤ä¸ª IP ï¼ä¼ä»¥å表ç形弿¾ç¤ºï¼éæ©ä½ å®é 使ç¨ç IP å°åã + +**submarine-installer** èå说æï¼ + + + +## é¨ç½²è¯´æ + +é¨ç½²æµç¨å¦ä¸æç¤ºï¼ + +1. åç §é 置说æï¼æ ¹æ®ä½ çæå¡å¨ä½¿ç¨æ åµé 置好 install.conf æä»¶ + +2. å°æ´ä¸ª `submarine-installer` æä»¶å¤¹æå å¤å¶å°ææçæå¡å¨èç¹ä¸ + +3. é¦å å¨é 置为 **DOWNLOAD_SERVER** çæå¡å¨ä¸ + + + è¿è¡ `submarine-installer/install.sh` å½ä»¤ + + + å¨å®è£ çé¢ä¸éæ© `[start download server]` èå项ï¼çå¾ ä¸è½½å®å个ä¾èµå åï¼å¯å¨ HTTP æå¡ + +4. å¨å ¶ä»éè¦è¿è¡é¨ç½²çæå¡å¨ä¸ + + è¿è¡ `submarine-installer/install.sh` å½ä»¤ï¼æ¾ç¤ºç主èå **[Main menu]** 䏿以ä¸èåï¼ + + + prepare system environment + + install component + + uninstall component + + start component + + stop component + + start download server + +5. **prepare system environment** + + + **prepare operation system** + + æ£æ¥é¨ç½²æå¡å¨çæä½ç³»ç»åçæ¬ï¼ + + + **prepare operation system kernel** + + æ¾ç¤ºæä½ç³»ç»å æ ¸æ´æ°çæä½å½ä»¤çæç¤ºä¿¡æ¯ï¼æ ¹æ®ä½ çéæ©æ¯å¦èªå¨æ´æ°å æ ¸çæ¬ï¼ + + + **prepare GCC version** + + æ¾ç¤ºæä½ç³»ç»ä¸ç°å¨ç GCC çæ¬å æ ¸æ´æ°çæä½å½ä»¤çæç¤ºä¿¡æ¯åæ ¹æ®ä½ çéæ©æ¯å¦èªå¨æ´æ° GCC çæ¬ï¼ + + + **check GPU** + + æ£æ¥æå¡å¨æ¯å¦è½å¤æ£æµå° GPU æ¾å¡ï¼ + + + **prepare user&group** + + æ¾ç¤ºæ·»å hadoop å docker çç¨æ·åç¨æ·ç»æä½å½ä»¤çæç¤ºä¿¡æ¯ï¼éè¦ä½ èªå·±æ ¹æ®æç¤ºä¿¡æ¯æ£æ¥æå¡å¨ä¸æ¯å¦å卿éè¦çç¨æ·åç¨æ·ç»ï¼ + + + **prepare nvidia environment** + + èªå¨è¿è¡æä½ç³»ç»å æ ¸å头æä»¶çæ´æ°ï¼èªå¨å®è£ `epel-release` å `dkms` ï¼ + + æ¾ç¤ºä¿®æ¹ç³»ç»å æ ¸åæ°é ç½®çæä½å½ä»¤çæç¤ºä¿¡æ¯ï¼éè¦ä½ å¦å¤æå¼ä¸ä¸ªç»ç«¯æ ¹æ®å½ä»¤é¡ºåºæ§è¡ï¼ + +6. install component + + + **instll etcd** + + ä¸è½½ etcd ç bin æä»¶ï¼å¹¶å®è£ å° `/usr/bin` ç®å½ä¸ï¼ + + æ ¹æ® **ETCD_HOSTS** é ç½®é¡¹çæ `etcd.service` æä»¶ï¼ å®è£ å° `/etc/systemd/system/` ç®å½ä¸ï¼ + + + **instll docker** + + ä¸è½½ docker ç RPM å è¿è¡æ¬å°å®è£ ï¼ + + çæ `daemon.json` é ç½®æä»¶ï¼å®è£ å° `/etc/docker/` ç®å½ä¸ï¼ + + çæ `docker.service` é ç½®æä»¶ï¼å®è£ å° `/etc/systemd/system/` ç®å½ä¸ï¼ + + + **instll calico network** + + ä¸è½½ `calico` ã`calicoctl` å `calico-ipam` æä»¶ï¼å®è£ å° `/usr/bin` ç®å½ä¸ï¼ + + çæ `calicoctl.cfg` é ç½®æä»¶ï¼å®è£ å° `/etc/calico/` ç®å½ä¸ï¼ + + çæ `calico-node.service` é ç½®æä»¶ï¼å®è£ å° `/etc/systemd/system/` ç®å½ä¸ï¼ + + å®è£ 宿¯åï¼ä¼å¨å®¹å¨ä¸ä¼æ ¹æ® **CALICO_NETWORK_NAME** é 置项èªå¨å建 calico networkï¼å¹¶èªå¨å建 2 个 Docker 容å¨ï¼æ£æ¥ 2 ä¸ªå®¹å¨æ¯å¦è½å¶äºç¸ PING éï¼ + + + **instll nvidia driver** + + ä¸è½½ `nvidia-detect` æä»¶ï¼å¨æå¡å¨ä¸æ£æµæ¾å¡çæ¬ï¼ + + æ ¹æ®æ¾å¡çæ¬å·ä¸è½½ Nvidia æ¾å¡é©±å¨å®è£ å ï¼ + + æ£æµæ¬æå¡å¨ä¸æ¯å¦ `disabled Nouveau` ï¼å¦ææ²¡æåæ¢å®è£ ï¼é£ä¹ä½ éè¦æ§è¡ **[prepare system environment]** èåä¸ç **[prepare nvidia environment]** åèåé¡¹ï¼æç §æç¤ºè¿è¡æä½ï¼ + + å¦ææ¬æå¡å¨ä¸å·²ç» `disabled Nouveau` ï¼é£ä¹å°±ä¼è¿è¡æ¬å°å®è£ ï¼ + + + **instll nvidia docker** + + ä¸è½½ `nvidia-docker` ç RPM å®è£ å å¹¶è¿è¡å®è£ ï¼ + + æ¾ç¤ºæ£æµ `nvidia-docker` æ¯å¦å¯ç¨çå½ä»¤æç¤ºä¿¡æ¯ï¼éè¦ä½ å¦å¤æå¼ä¸ä¸ªç»ç«¯æ ¹æ®å½ä»¤é¡ºåºæ§è¡ï¼ + + + **instll yarn container-executor** + + æ ¹æ® **YARN_CONTAINER_EXECUTOR_PATH é 置项**ï¼å° `container-executor` æä»¶å¤å¶å° `/etc/yarn/sbin/Linux-amd64-64/` ç®å½ä¸ï¼ + + æ ¹æ®é ç½®çæ `container-executor.cfg` æä»¶ï¼å¤å¶å° `/etc/yarn/sbin/etc/hadoop/` ç®å½ä¸ï¼ + + + **instll submarine autorun script** + + å¤å¶ `submarine.sh` æä»¶å° `/etc/rc.d/init.d/` ç®å½ä¸ï¼ + + å° `/etc/rc.d/init.d/submarine.sh` æ·»å å° `/etc/rc.d/rc.local` ç³»ç»èªå¯å¨æä»¶ä¸ï¼ + +7. uninstall component + + å 餿å®ç»ä»¶ç BIN æä»¶åé ç½®æä»¶ï¼ä¸å¨å¤è¿° + + - uninstll etcd + - uninstll docker + - uninstll calico network + - uninstll nvidia driver + - uninstll nvidia docker + - uninstll yarn container-executor + - uninstll submarine autorun script + +8. start component + + é坿å®ç»ä»¶ï¼ä¸å¨å¤è¿° + + - start etcd + - start docker + - start calico network + +9. stop component + + 忢æå®ç»ä»¶ï¼ä¸å¨å¤è¿° + + - stop etcd + - stop docker + - stop calico network + +10. start download server + + åªè½å¨ **DOWNLOAD_SERVER_IP é 置项** æå¨çæå¡å¨ä¸æè½æ§è¡æ¬æä½ï¼ + http://git-wip-us.apache.org/repos/asf/hadoop/blob/ed08dd3b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md new file mode 100644 index 0000000..c1a408f --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md @@ -0,0 +1,250 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + + +# submarine installer + +## Introduction + +Hadoop {Submarine} is the latest machine learning framework subproject in the Hadoop 3.2 release. It allows Hadoop to support `Tensorflow`, `MXNet`,` Caffe`, `Spark`, etc. A variety of deep learning frameworks provide a full-featured system framework for machine learning algorithm development, distributed model training, model management, and model publishing, combined with hadoop's intrinsic data storage and data processing capabilities to enable data scientists to Good mining and the value of the data. + +Hadoop has enabled YARN to support Docker container since 2.x. **Hadoop {Submarine}** then uses YARN to schedule and run the distributed deep learning framework in the form of a Docker container. + +Since the distributed deep learning framework needs to run in multiple Docker containers and needs to be able to coordinate the various services running in the container, complete the services of model training and model publishing for distributed machine learning. Involving multiple system engineering problems such as `DNS`, `Docker`, `GPU`, `Network`, `graphics card`, `operating system kernel` modification, etc. It is very difficult and time-consuming to properly deploy the **Hadoop {Submarine}** runtime environment. + +In order to reduce the difficulty of deploying components, we have developed this **submarine-installer** project to deploy the **Hadoop {Submarine}** runtime environment, providing a one-click installation script or step-by-step installation. Unload, start, and stop individual components, and explain the main parameter configuration and considerations for each step. We also submitted a [Chinese manual](InstallationGuideChineseVersion.md) and an [English manual](InstallationGuide.md) for the **Hadoop {Submarine}** runtime environment to the hadoop community to help users deploy more easily and find problems in a timely manner. + +This installer is just created for your convenience. You can choose to install required libraries by yourself. + +## prerequisites + +**submarine-installer** currently only supports operating systems based on `centos-release-7-3.1611.el7.centos.x86_64` and above. + +## Configuration instructions + +Before deploying with submarine-installer, you can refer to the existing configuration parameters and format in the `install.conf` file, and configure the following parameters according to your usage: + ++ **DNS Configuration** + + LOCAL_DNS_HOST: server-side local DNS IP address configuration, which can be viewed from `/etc/resolv.conf` + + YARN_DNS_HOST: yarn dns server started IP address + ++ **ETCD Configuration** + + Machine learning is a computationally-density system that requires very high data transmission performance. Therefore, we use the ETCD network component with the least network efficiency loss. It can support the overlay network through BGP routing and support tunnel mode when deployed across the equipment room. + + Please note that you can choose to use different Docker networks. ETCD is not the only network solution supported by Submarine. + + You need to select at least three servers as the running server for ETCD, which will make **Hadoop {Submarine}** better fault tolerant and stable. + + Enter the IP array as the ETCD server in the ETCD_HOSTS configuration item. The parameter configuration is generally like this: + + ETCD_HOSTS=(hostIP1 hostIP2 hostIP3). Note that spaces between multiple hostIPs should be separated by spaces. + ++ **DOCKER_REGISTRY Configuration** + + You can follow the following step to setup your Docker registry. But it is not a hard requirement since you can use a pre-setup Docker registry instead. + + You first need to install an image management repository for the available docker. This image repository is used to store the image files of the various deep learning frameworks you need, and then configure the IP address and port of the mirror repository. The parameter configuration is generally the same : + + DOCKER_REGISTRY="10.120.196.232:5000" + ++ **DOWNLOAD_SERVER Configuration** + + By default, **submarine-installer** downloads all dependencies directly from the network (eg GCC, Docker, Nvidia drivers, etc.), which often takes a lot of time and cannot be used in environments where some servers cannot connect to the Internet. Deployment, so we built the HTTP download service in **submarine-installer**, you only need to run **submarine-installer** on a server that can connect to the Internet, you can download the dependencies for all other servers, you only need Follow these configurations: + + 1. First, you need to configure `DOWNLOAD_SERVER_IP` as a server IP address that can connect to the Internet, and configure `DOWNLOAD_SERVER_PORT` as a port that is not very common. + 2. After running the `submarine-installer/install.sh` command on the server where `DOWNLOAD_SERVER_IP` is located, select the `[start download server]` menu item in the installation interface. **submarine-installer** will download all the dependencies of the deployment to the server. In the `submarine-installer/downloads` directory, start an HTTP download service with the `python -m SimpleHTTPServer ${DOWNLOAD_SERVER_PORT}` command. Do not close the **submarine-installer** running on this server. + 3. When you run the `submarine-installer/install.sh` command on other servers and follow the `[install component]` menu in the installation interface to install each component in turn, it will automatically download the dependencies from the server where `DOWNLOAD_SERVER_IP` is located for installation and deployment. . + 4. **DOWNLOAD_SERVER** Another useful thing is that you can manually download the dependencies by hand, put them in the `submarine-installer/downloads` directory of one of the servers, and then open `[start download server]`, so that you can The cluster provides the ability to deploy offline deployments. + ++ **YARN_CONTAINER_EXECUTOR_PATH Configuration** + + You can get container-executor binary from either binary release package or build from source. + You need to fill in the full path of the container-executor file in the `YARN_CONTAINER_EXECUTOR_PATH` configuration item. + ++ **YARN_HIERARCHY Configuration** + + Please keep the same configuration as `yarn.nodemanager.linux-container-executor.cgroups.hierarchy` in the `yarn-site.xml` configuration file of the YARN cluster you are using. If this is not configured in `yarn-site.xml`, Then the default is `/hadoop-yarn`. + ++ **YARN_NODEMANAGER_LOCAL_DIRS Configuration** + + Please keep the same configuration as `yarn.nodemanager.local-dirs` in the `yarn-site.xml` configuration file of the YARN cluster you are using. + ++ **YARN_NODEMANAGER_LOG_DIRS Configuration** + + Please keep the same configuration as `yarn.nodemanager.log-dirs` in the `yarn-site.xml` configuration file of the YARN cluster you are using. + +## Instructions for use + +**submarine-installer** is completely written in shell script. It does not need to install any deployment tools such as ansible. It avoids different server management specifications of different company users and causes the program to be uncommon. For example, some computer rooms do not allow ROOT users to directly remotely through SHELL. Server operation, etc. + +The deployment process of **submarine-installer** is completely performed by selecting the operation in the menu. It avoids misoperations. You can also install, uninstall, and start any component in each step through various menu items. And the various components are stopped, and the flexibility is very good. After some components have problems, the system can also be diagnosed and repaired by **submarine-installer**. + +**submarine-installer** The log information is displayed on the screen during the deployment process. The log information has three font colors: + ++ Red font color: Indicates that the component installation has an error and the deployment has terminated. + ++ Green text color: The component is installed properly and the deployment is working properly. + ++ Blue text color: You need to manually enter the command in another SHELL terminal according to the prompt information. Generally, modify the operating system kernel configuration operation, and follow the prompt information to operate it. + +**Start submarine-installer** + +Run the `submarine-installer/install.sh` command to start. The deployment program first detects the IP address of the network card in the server. If the server has multiple network cards or multiple IP addresses configured, it will be displayed in the form of a list. Select the one you actually use. IP address. + +**submarine-installer** Menu descriptionï¼ + + + +## Deployment instructions + +The deployment process is as follows: + +1. Refer to the configuration instructions to configure the `install.conf` file based on your server usage. + +2. Copy the entire **submarine-installer** folder to all server nodes + +3. First in the server configured as **DOWNLOAD_SERVER** + + + Run the `submarine-installer/install.sh` command + + + Select the `[start download server]` menu item in the installation interface, and wait for the download of each dependency package to start the HTTP service. + +4. **In other servers that need to be deployed** + + Run the `submarine-installer/install.sh` command to display the following menu in the main menu **[Main menu]**: + + + prepare system environment + + install component + + uninstall component + + start component + + stop component + + start download server + +5. **prepare system environment** + + - **prepare operation system** + + Check the operating system and version of the deployment server; + + - **prepare operation system kernel** + + Display the prompt information of the operation command of the operating system kernel update, and automatically update the kernel version according to your choice; + + - **prepare GCC version** + + Display the prompt information of the operation command of the current GCC version kernel update in the operating system and whether to automatically update the GCC version according to your choice; + + - **check GPU** + + Check if the server can detect the GPU graphics card; + + - **prepare user&group** + + Display the prompts for adding user and user group operation commands for hadoop and docker. You need to check whether there are any required users and user groups in the server according to the prompt information. + + - **prepare nvidia environment** + + Automatically update the operating system kernel and header files, and automatically install `epel-release` and `dkms`; + + Display the prompt information for modifying the operation command of the system kernel parameter configuration, you need to open another terminal according to the command sequence; + +6. **install component** + + - **instll etcd** + + Download the bin file for etcd and install it in the `/usr/bin` directory; + + Generate the `etcd.service` file according to the **ETCD_HOSTS** configuration item and install it into the `/etc/systemd/system/` directory. + + - **instll docker** + + Download docker's RPM package for local installation; + + Generate the `daemon.json` configuration file and install it into the `/etc/docker/` directory. + + Generate the `docker.service` configuration file and install it into the `/etc/systemd/system/` directory. + + - **instll calico network** + + Download the `calico`, `calicoctl`, and `calico-ipam` files and install them in the `/usr/bin` directory. + + Generate the `calicoctl.cfg` configuration file and install it into the `/etc/calico/` directory. + + Generate the `calico-node.service` configuration file and install it into the `/etc/systemd/system/` directory. + + After the installation is complete, the calico network will be automatically created in the container according to the **CALICO_NETWORK_NAME** configuration item, and two Docker containers will be created automatically to check whether the two containers can even ping each other. + + - **instll nvidia driver** + + Download the `nvidia-detect` file to detect the graphics card version in the server;Download the `nvidia-detect` file to detect the graphics card version in the server; + + Download the Nvidia graphics driver installation package according to the graphics card version number; + + Check if the Nouveau is disabled in this server. If the installation is not stopped, you need to execute the **[prepare nvidia environment]** submenu item in the **[prepare system environment]** menu and follow the prompts. + + If Nouveau has been disabled in this server, it will be installed locally; + + - **instll nvidia docker** + + Download the nvidia-docker RPM installation package and install it; + + Display the command prompt information to detect whether nvidia-docker is available. You need to open another terminal to execute according to the command sequence. + + - **instll yarn container-executor** + + Copy the `container-executor` file to the `/etc/yarn/sbin/Linux-amd64-64/` directory according to the **YARN_CONTAINER_EXECUTOR_PATH** configuration item; + + Generate the `container-executor.cfg` file according to the configuration and copy it to the `/etc/yarn/sbin/etc/hadoop/` directory. + + - **instll submarine autorun script** + + Copy the submarine.sh file to the `/etc/rc.d/init.d/` directory; + + Add `/etc/rc.d/init.d/submarine.sh` to the `/etc/rc.d/rc.local` system self-starting file; + +7. uninstall component + + Delete the BIN file and configuration file of the specified component, not in the retelling + + - uninstll etcd + - uninstll docker + - uninstll calico network + - uninstll nvidia driver + - uninstll nvidia docker + - uninstll yarn container-executor + - uninstll submarine autorun script + +8. start component + + Restart the specified component, not repeat + + - start etcd + - start docker + - start calico network + +9. stop component + + Stop specifying component, not repeating + + - stop etcd + - stop docker + - stop calico network + +10. start download server + + This operation can only be performed on the server where the **DOWNLOAD_SERVER_IP** configuration item is located; + http://git-wip-us.apache.org/repos/asf/hadoop/blob/ed08dd3b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/TestAndTroubleshooting.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/TestAndTroubleshooting.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/TestAndTroubleshooting.md new file mode 100644 index 0000000..3acf81a --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/TestAndTroubleshooting.md @@ -0,0 +1,165 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +#### Test with a tensorflow job + +Distributed-shell + GPU + cgroup + +```bash + ./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ + --env DOCKER_JAVA_HOME=/opt/java \ + --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ + --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ + --worker_docker_image gpu-cuda9.0-tf1.8.0-with-models \ + --ps_docker_image dockerfile-cpu-tf1.8.0-with-models \ + --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \ + --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \ + --num_ps 0 \ + --ps_resources memory=4G,vcores=2,gpu=0 \ + --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \ + --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ + --num_workers 1 \ + --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" +``` + + + +## Issues: + +### Issue 1: Fail to start nodemanager after system reboot + +``` +2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems! +org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58) + at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389) + at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) +2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED +``` + +Solution: Grant user yarn the access to `/sys/fs/cgroup/cpu,cpuacct`, which is the subfolder of cgroup mount destination. + +``` +chown :yarn -R /sys/fs/cgroup/cpu,cpuacct +chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct +``` + +If GPUs are usedï¼the access to cgroup devices folder is neede as well + +``` +chown :yarn -R /sys/fs/cgroup/devices +chmod g+rwx -R /sys/fs/cgroup/devices +``` + + +### Issue 2: container-executor permission denied + +``` +2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command: +java.io.IOException: Cannot run program "/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied + at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) + at org.apache.hadoop.util.Shell.runCommand(Shell.java:938) + at org.apache.hadoop.util.Shell.run(Shell.java:901) + at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) +``` + +Solution: The permission of `/etc/yarn/sbin/Linux-amd64-64/container-executor` should be 6050 + +### Issue 3ï¼How to get docker service log + +Solution: we can get docker log with the following command + +``` +journalctl -u docker +``` + +### Issue 4ï¼docker can't remove containers with errors like `device or resource busy` + +```bash +$ docker rm 0bfafa146431 +Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy +``` + +Solution: to find which process leads to a `device or resource busy`, we can add a shell script, named `find-busy-mnt.sh` + +```bash +#!/bin/bash + +# A simple script to get information about mount points and pids and their +# mount namespaces. + +if [ $# -ne 1 ];then +echo "Usage: $0 <devicemapper-device-id>" +exit 1 +fi + +ID=$1 + +MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null` + +[ -z "$MOUNTS" ] && echo "No pids found" && exit 0 + +printf "PID\tNAME\t\tMNTNS\n" +echo "$MOUNTS" | while read LINE; do +PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3` +# Ignore self and thread-self +if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then + continue +fi +NAME=`ps -q $PID -o comm=` +MNTNS=`readlink /proc/$PID/ns/mnt` +printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS" +done +``` + +Kill the process by pid, which is found by the script + +```bash +$ chmod +x find-busy-mnt.sh +./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a +# PID NAME MNTNS +# 5007 ntpd mnt:[4026533598] +$ kill -9 5007 +``` + + +### Issue 5ï¼Failed to execute `sudo nvidia-docker run` + +``` +docker: Error response from daemon: create nvidia_driver_361.42: VolumeDriver.Create: internal error, check logs for details. +See 'docker run --help'. +``` + +Solution: + +``` +#check nvidia-docker status +$ systemctl status nvidia-docker +$ journalctl -n -u nvidia-docker +#restart nvidia-docker +systemctl stop nvidia-docker +systemctl start nvidia-docker +``` + +### Issue 6ï¼Yarn failed to start containers + +if the number of GPUs required by applications is larger than the number of GPUs in the cluster, there would be some containers can't be created. + http://git-wip-us.apache.org/repos/asf/hadoop/blob/ed08dd3b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/resources/images/submarine-installer.gif ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/resources/images/submarine-installer.gif b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/resources/images/submarine-installer.gif new file mode 100644 index 0000000..56b3b69 Binary files /dev/null and b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/resources/images/submarine-installer.gif differ --------------------------------------------------------------------- To unsubscribe, e-mail: common-commits-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-commits-h...@hadoop.apache.org