Repository: hadoop Updated Branches: refs/heads/branch-3.2 f6a73d181 -> 8e8b74872
YARN-8852. Add documentation for submarine installation details. (Zac Zhou via wangda) Change-Id: If5681d1ef37ff5dc916735eeef15a6120173d653 (cherry picked from commit a23ea68b9747eae9b176f908bb04b76d30fe3795) Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/8e8b7487 Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/8e8b7487 Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/8e8b7487 Branch: refs/heads/branch-3.2 Commit: 8e8b74872498077a1c3568b5f55b4e215669bad4 Parents: f6a73d1 Author: Wangda Tan <[email protected]> Authored: Tue Oct 9 10:18:00 2018 -0700 Committer: Wangda Tan <[email protected]> Committed: Tue Oct 9 10:19:16 2018 -0700 ---------------------------------------------------------------------- .../src/site/markdown/Index.md | 4 + .../src/site/markdown/InstallationGuide.md | 760 +++++++++++++++++++ .../markdown/InstallationGuideChineseVersion.md | 757 ++++++++++++++++++ 3 files changed, 1521 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/hadoop/blob/8e8b7487/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md index 0b78a87..0006f6c 100644 --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md @@ -40,3 +40,7 @@ Click below contents if you want to understand more. - [How to write Dockerfile for Submarine jobs](WriteDockerfile.html) - [Developer guide](DeveloperGuide.html) + +- [Installation guide](InstallationGuide.html) + +- [Installation guide Chinese version](InstallationGuideChineseVersion.html) http://git-wip-us.apache.org/repos/asf/hadoop/blob/8e8b7487/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md new file mode 100644 index 0000000..d4f4269 --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md @@ -0,0 +1,760 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Submarine Installation Guide + +## Prerequisites + +### Operating System + +The operating system and kernel versions we used are as shown in the following table, which should be minimum required versions: + +| Enviroment | Verion | +| ------ | ------ | +| Operating System | centos-release-7-3.1611.el7.centos.x86_64 | +| Kernal | 3.10.0-514.el7.x86_64 | + +### User & Group + +As there are some specific users and groups need to be created to install hadoop/docker. Please create them if they are missing. + +``` +adduser hdfs +adduser mapred +adduser yarn +addgroup hadoop +usermod -aG hdfs,hadoop hdfs +usermod -aG mapred,hadoop mapred +usermod -aG yarn,hadoop yarn +usermod -aG hdfs,hadoop hadoop +groupadd docker +usermod -aG docker yarn +usermod -aG docker hadoop +``` + +### GCC Version + +Check the version of GCC tool + +```bash +gcc --version +gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11) +# install if needed +yum install gcc make g++ +``` + +### Kernel header & Kernel devel + +```bash +# Approach 1ï¼ +yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) +# Approach 2ï¼ +wget http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-514.el7.x86_64.rpm +rpm -ivh kernel-headers-3.10.0-514.el7.x86_64.rpm +``` + +### GPU Servers + +``` +lspci | grep -i nvidia + +# If the server has gpus, you can get info like thisï¼ +04:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1) +82:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1) +``` + + + +### Nvidia Driver Installation + +If nvidia driver/cuda has been installed before, They should be uninstalled firstly. + +``` +# uninstall cudaï¼ +sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl + +# uninstall nvidia-driverï¼ +sudo /usr/bin/nvidia-uninstall +``` + +To check GPU version, install nvidia-detect + +``` +yum install nvidia-detect +# run 'nvidia-detect -v' to get reqired nvidia driver versionï¼ +nvidia-detect -v +Probing for supported NVIDIA devices... +[10de:13bb] NVIDIA Corporation GM107GL [Quadro K620] +This device requires the current 390.87 NVIDIA driver kmod-nvidia +[8086:1912] Intel Corporation HD Graphics 530 +An Intel display controller was also detected +``` + +Pay attention to `This device requires the current 390.87 NVIDIA driver kmod-nvidia`. +Download the installer [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html). + + +Some preparatory work for nvidia driver installation + +``` +# It may take a while to update +yum -y update +yum -y install kernel-devel + +yum -y install epel-release +yum -y install dkms + +# Disable nouveau +vim /etc/default/grub +# Add the following configuration in âGRUB_CMDLINE_LINUXâ part +rd.driver.blacklist=nouveau nouveau.modeset=0 + +# Generate configuration +grub2-mkconfig -o /boot/grub2/grub.cfg + +vim /etc/modprobe.d/blacklist.conf +# Add confiuration: +blacklist nouveau + +mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img +dracut /boot/initramfs-$(uname -r).img $(uname -r) +reboot +``` + +Check whether nouveau is disabled + +``` +lsmod | grep nouveau # return null + +# install nvidia driver +sh NVIDIA-Linux-x86_64-390.87.run +``` + +Some options during the installation + +``` +Install NVIDIA's 32-bit compatibility libraries (Yes) +centos Install NVIDIA's 32-bit compatibility libraries (Yes) +Would you like to run the nvidia-xconfig utility to automatically update your X configuration file... (NO) +``` + + +Check nvidia driver installation + +``` +nvidia-smi +``` + +Referenceï¼ +https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html + + + +### Docker Installation + +``` +yum -y update +yum -y install yum-utils +yum-config-manager --add-repo https://yum.dockerproject.org/repo/main/centos/7 +yum -y update + +# Show available packages +yum search --showduplicates docker-engine + +# Install docker 1.12.5 +yum -y --nogpgcheck install docker-engine-1.12.5* +systemctl start docker + +chown hadoop:netease /var/run/docker.sock +chown hadoop:netease /usr/bin/docker +``` + +Referenceï¼https://docs.docker.com/cs-engine/1.12/ + +### Docker Configuration + +Add a file, named daemon.json, under the path of /etc/docker/. Please replace the variables of image_registry_ip, etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ip with specific ips according to your environments. + +``` +{ + "insecure-registries": ["${image_registry_ip}:5000"], + "cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379", + "cluster-advertise":"{localhost_ip}:2375", + "dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip1}"], + "hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"] +} +``` + +Restart docker daemonï¼ + +``` +sudo systemctl restart docker +``` + + + +### Docker EE version + +```bash +$ docker version + +Client: + Version: 1.12.5 + API version: 1.24 + Go version: go1.6.4 + Git commit: 7392c3b + Built: Fri Dec 16 02:23:59 2016 + OS/Arch: linux/amd64 + +Server: + Version: 1.12.5 + API version: 1.24 + Go version: go1.6.4 + Git commit: 7392c3b + Built: Fri Dec 16 02:23:59 2016 + OS/Arch: linux/amd64 +``` + +### Nvidia-docker Installation + +Submarine is based on nvidia-docker 1.0 version + +``` +wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm +sudo rpm -i /tmp/nvidia-docker*.rpm +# Start nvidia-docker +sudo systemctl start nvidia-docker + +# Check nvidia-docker statusï¼ +systemctl status nvidia-docker + +# Check nvidia-docker logï¼ +journalctl -u nvidia-docker + +# Test nvidia-docker-plugin +curl http://localhost:3476/v1.0/docker/cli +``` + +According to `nvidia-driver` version, add folders under the path of `/var/lib/nvidia-docker/volumes/nvidia_driver/` + +``` +mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87 +# 390.8 is nvidia driver version + +mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin +mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64 + +cp /usr/bin/nvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin +cp /usr/lib64/libcuda* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64 +cp /usr/lib64/libnvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64 + +# Test with nvidia-smi +nvidia-docker run --rm nvidia/cuda:9.0-devel nvidia-smi +``` + +Test docker, nvidia-docker, nvidia-driver installation + +``` +# Test 1 +nvidia-docker run -rm nvidia/cuda nvidia-smi +``` + +``` +# Test 2 +nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash +# In docker container +python +import tensorflow as tf +tf.test.is_gpu_available() +``` + +[The way to uninstall nvidia-docker 1.0](https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)) + +Reference: +https://github.com/NVIDIA/nvidia-docker/tree/1.0 + + + +### Tensorflow Image + +There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. we can get basic docker images by following WriteDockerfile.md. + + +The basic Dockerfile doesn't support kerberos security. if you need kerberos, you can get write a Dockerfile like this + + +```shell +FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 + +# Pick up some TF dependencies +RUN apt-get update && apt-get install -y --allow-downgrades --no-install-recommends \ + build-essential \ + cuda-command-line-tools-9-0 \ + cuda-cublas-9-0 \ + cuda-cufft-9-0 \ + cuda-curand-9-0 \ + cuda-cusolver-9-0 \ + cuda-cusparse-9-0 \ + curl \ + libcudnn7=7.0.5.15-1+cuda9.0 \ + libfreetype6-dev \ + libpng12-dev \ + libzmq3-dev \ + pkg-config \ + python \ + python-dev \ + rsync \ + software-properties-common \ + unzip \ + && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + +RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean + +RUN curl -O https://bootstrap.pypa.io/get-pip.py && \ + python get-pip.py && \ + rm get-pip.py + +RUN pip --no-cache-dir install \ + Pillow \ + h5py \ + ipykernel \ + jupyter \ + matplotlib \ + numpy \ + pandas \ + scipy \ + sklearn \ + && \ + python -m ipykernel.kernelspec + +# Install TensorFlow GPU version. +RUN pip --no-cache-dir install \ + http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp27-none-linux_x86_64.whl +RUN apt-get update && apt-get install git -y + +RUN apt-get update && apt-get install -y openjdk-8-jdk wget +# Downloadhadoop-3.1.1.tar.gz +RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz +RUN tar zxf hadoop-3.1.1.tar.gz +RUN mv hadoop-3.1.1 hadoop-3.1.0 + +# Download jdk which supports kerberos +RUN wget -qO jdk8.tar.gz 'http://${kerberos_jdk_url}/jdk-8u152-linux-x64.tar.gz' +RUN tar xzf jdk8.tar.gz -C /opt +RUN mv /opt/jdk* /opt/java +RUN rm jdk8.tar.gz +RUN update-alternatives --install /usr/bin/java java /opt/java/bin/java 100 +RUN update-alternatives --install /usr/bin/javac javac /opt/java/bin/javac 100 + +ENV JAVA_HOME /opt/java +ENV PATH $PATH:$JAVA_HOME/bin +``` + + +### Test tensorflow in a docker container + +After docker image is built, we can check +tensorflow environments before submitting a yarn job. + +```shell +$ docker run -it ${docker_image_name} /bin/bash +# >>> In the docker container +$ python +$ python >> import tensorflow as tf +$ python >> tf.__version__ +``` + +If there are some errors, we could check the following configuration. + +1. LD_LIBRARY_PATH environment variable + + ``` + echo $LD_LIBRARY_PATH + /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 + ``` + +2. The location of libcuda.so.1, libcuda.so + + ``` + ls -l /usr/local/nvidia/lib64 | grep libcuda.so + ``` + +### Etcd Installation + +To install Etcd on specified servers, we can run Submarine/install.sh + +```shell +$ ./Submarine/install.sh +# Etcd status +systemctl status Etcd.service +``` + +Check Etcd cluster health + +```shell +$ etcdctl cluster-health +member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379 +member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379 +member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379 +cluster is healthy + +$ etcdctl member list +3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false +85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false +b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true +``` + + + +### Calico Installation + +To install Calico on specified servers, we can run Submarine/install.sh + +``` +systemctl start calico-node.service +systemctl status calico-node.service +``` + +#### Check Calico Network + +```shell +# Run the following command to show the all host status in the cluster except localhost. +$ calicoctl node status +Calico process is running. + +IPv4 BGP status ++---------------+-------------------+-------+------------+-------------+ +| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | ++---------------+-------------------+-------+------------+-------------+ +| ${host_ip1} | node-to-node mesh | up | 2018-09-21 | Established | +| ${host_ip2} | node-to-node mesh | up | 2018-09-21 | Established | +| ${host_ip3} | node-to-node mesh | up | 2018-09-21 | Established | ++---------------+-------------------+-------+------------+-------------+ + +IPv6 BGP status +No IPv6 peers found. +``` + +Create containers to validate calico network + +``` +docker network create --driver calico --ipam-driver calico-ipam calico-network +docker run --net calico-network --name workload-A -tid busybox +docker run --net calico-network --name workload-B -tid busybox +docker exec workload-A ping workload-B +``` + + +## Hadoop Installation + +### Compile hadoop source code + +``` +mvn package -Pdist -DskipTests -Dtar +``` + + +### Start yarn service + +``` +YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager +YARN_LOGFILE=nodemanager.log ./sbin/yarn-daemon.sh start nodemanager +YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver +YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver +``` + +### Start yarn registery dns service + +``` +sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns +``` + +### Test with a MR wordcount job + +``` +./bin/hadoop jar /home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar wordcount /tmp/wordcount.txt /tmp/wordcount-output4 +``` + + + +## Tensorflow Job with CPU + +### Standalone Mode + +#### Clean up apps with the same name + +Suppose we want to submit a tensorflow job named standalone-tf, destroy any application with the same name and clean up historical job directories. + +```bash +./bin/yarn app -destroy standalone-tf +./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir +``` +where ${dfs_name_service} is the hdfs name service you use + +#### Run a standalone tensorflow job + +```bash +./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ + --env DOCKER_JAVA_HOME=/opt/java \ + --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name standalone-tf \ + --docker_image dockerfile-cpu-tf1.8.0-with-models \ + --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \ + --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-checkpoint \ + --worker_resources memory=4G,vcores=2 --verbose \ + --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --num-gpus=0" +``` + +### Distributed Mode + +#### Clean up apps with the same name + +```bash +./bin/yarn app -destroy distributed-tf +./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir +``` + +#### Run a distributed tensorflow job + +```bash +./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ + --env DOCKER_JAVA_HOME=/opt/java \ + --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf \ + --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ + --docker_image dockerfile-cpu-tf1.8.0-with-models \ + --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \ + --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \ + --worker_resources memory=4G,vcores=2 --verbose \ + --num_ps 1 \ + --ps_resources memory=4G,vcores=2 \ + --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \ + --num_workers 4 \ + --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0" +``` + + +## Tensorflow Job with GPU + +### GPU configurations for both resourcemanager and nodemanager + +Add the yarn resource configuration file, named resource-types.xml + + ``` + <configuration> + <property> + <name>yarn.resource-types</name> + <value>yarn.io/gpu</value> + </property> + </configuration> + ``` + +#### GPU configurations for resourcemanager + +The scheduler used by resourcemanager must be capacity scheduler, and yarn.scheduler.capacity.resource-calculator in capacity-scheduler.xml should be DominantResourceCalculator + + ``` + <configuration> + <property> + <name>yarn.scheduler.capacity.resource-calculator</name> + <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value> + </property> + </configuration> + ``` + +#### GPU configurations for nodemanager + +Add configurations in yarn-site.xml + + ``` + <configuration> + <property> + <name>yarn.nodemanager.resource-plugins</name> + <value>yarn.io/gpu</value> + </property> + </configuration> + ``` + +Add configurations in container-executor.cfg + + ``` + [docker] + ... + # Add configurations in `[docker]` partï¼ + # /usr/bin/nvidia-docker is the path of nvidia-docker command + # nvidia_driver_375.26 means that nvidia driver version is 375.26. nvidia-smi command can be used to check the version + docker.allowed.volume-drivers=/usr/bin/nvidia-docker + docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0 + docker.allowed.ro-mounts=nvidia_driver_375.26 + + [gpu] + module.enabled=true + + [cgroups] + # /sys/fs/cgroup is the cgroup mount destination + # /hadoop-yarn is the path yarn creates by default + root=/sys/fs/cgroup + yarn-hierarchy=/hadoop-yarn + ``` + +#### Test with a tensorflow job + +Distributed-shell + GPU + cgroup + +```bash + ./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ + --env DOCKER_JAVA_HOME=/opt/java \ + --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ + --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ + --docker_image gpu-cuda9.0-tf1.8.0-with-models \ + --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \ + --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \ + --num_ps 0 \ + --ps_resources memory=4G,vcores=2,gpu=0 \ + --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \ + --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ + --num_workers 1 \ + --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" +``` + + + +## Issues: + +### Issue 1: Fail to start nodemanager after system reboot + +``` +2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems! +org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58) + at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389) + at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) +2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED +``` + +Solution: Grant user yarn the access to `/sys/fs/cgroup/cpu,cpuacct`, which is the subfolder of cgroup mount destination. + +``` +chown :yarn -R /sys/fs/cgroup/cpu,cpuacct +chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct +``` + +If GPUs are usedï¼the access to cgroup devices folder is neede as well + +``` +chown :yarn -R /sys/fs/cgroup/devices +chmod g+rwx -R /sys/fs/cgroup/devices +``` + + +### Issue 2: container-executor permission denied + +``` +2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command: +java.io.IOException: Cannot run program "/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied + at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) + at org.apache.hadoop.util.Shell.runCommand(Shell.java:938) + at org.apache.hadoop.util.Shell.run(Shell.java:901) + at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) +``` + +Solution: The permission of `/etc/yarn/sbin/Linux-amd64-64/container-executor` should be 6050 + +### Issue 3ï¼How to get docker service log + +Solution: we can get docker log with the following command + +``` +journalctl -u docker +``` + +### Issue 4ï¼docker can't remove containers with errors like `device or resource busy` + +```bash +$ docker rm 0bfafa146431 +Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy +``` + +Solution: to find which process leads to a `device or resource busy`, we can add a shell script, named `find-busy-mnt.sh` + +```bash +#!/bin/bash + +# A simple script to get information about mount points and pids and their +# mount namespaces. + +if [ $# -ne 1 ];then +echo "Usage: $0 <devicemapper-device-id>" +exit 1 +fi + +ID=$1 + +MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null` + +[ -z "$MOUNTS" ] && echo "No pids found" && exit 0 + +printf "PID\tNAME\t\tMNTNS\n" +echo "$MOUNTS" | while read LINE; do +PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3` +# Ignore self and thread-self +if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then + continue +fi +NAME=`ps -q $PID -o comm=` +MNTNS=`readlink /proc/$PID/ns/mnt` +printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS" +done +``` + +Kill the process by pid, which is found by the script + +```bash +$ chmod +x find-busy-mnt.sh +./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a +# PID NAME MNTNS +# 5007 ntpd mnt:[4026533598] +$ kill -9 5007 +``` + + +### Issue 5ï¼Failed to execute `sudo nvidia-docker run` + +``` +docker: Error response from daemon: create nvidia_driver_361.42: VolumeDriver.Create: internal error, check logs for details. +See 'docker run --help'. +``` + +Solution: + +``` +#check nvidia-docker status +$ systemctl status nvidia-docker +$ journalctl -n -u nvidia-docker +#restart nvidia-docker +systemctl stop nvidia-docker +systemctl start nvidia-docker +``` + +### Issue 6ï¼Yarn failed to start containers + +if the number of GPUs required by applications is larger than the number of GPUs in the cluster, there would be some containers can't be created. + http://git-wip-us.apache.org/repos/asf/hadoop/blob/8e8b7487/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuideChineseVersion.md ---------------------------------------------------------------------- diff --git a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuideChineseVersion.md b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuideChineseVersion.md new file mode 100644 index 0000000..471b8fc --- /dev/null +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuideChineseVersion.md @@ -0,0 +1,757 @@ +<!--- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +# Submarine å®è£ 说æ + +## Prerequisites + +### æä½ç³»ç» + +æä»¬ä½¿ç¨çæä½ç³»ç»çæ¬æ¯ centos-release-7-3.1611.el7.centos.x86_64, å æ ¸çæ¬æ¯ 3.10.0-514.el7.x86_64 ï¼åºè¯¥æ¯æä½çæ¬äºã + +| Enviroment | Verion | +| ------ | ------ | +| Operating System | centos-release-7-3.1611.el7.centos.x86_64 | +| Kernal | 3.10.0-514.el7.x86_64 | + +### User & Group + +妿æä½ç³»ç»ä¸æ²¡æè¿äºç¨æ·ç»åç¨æ·ï¼å¿ 须添å ãä¸é¨åç¨æ·æ¯ hadoop è¿è¡éè¦ï¼ä¸é¨åç¨æ·æ¯ docker è¿è¡éè¦ã + +``` +adduser hdfs +adduser mapred +adduser yarn +addgroup hadoop +usermod -aG hdfs,hadoop hdfs +usermod -aG mapred,hadoop mapred +usermod -aG yarn,hadoop yarn +usermod -aG hdfs,hadoop hadoop +groupadd docker +usermod -aG docker yarn +usermod -aG docker hadoop +``` + +### GCC çæ¬ + +```bash +gcc --version +gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11) +# å¦ææ²¡æå®è£ 请æ§è¡ä»¥ä¸å½ä»¤è¿è¡å®è£ +yum install gcc make g++ +``` + +### Kernel header & devel + +```bash +# æ¹æ³ä¸ï¼ +yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) +# æ¹æ³äºï¼ +wget http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-514.el7.x86_64.rpm +rpm -ivh kernel-headers-3.10.0-514.el7.x86_64.rpm +``` + +### æ£æ¥ GPU çæ¬ + +``` +lspci | grep -i nvidia + +# 妿ä»ä¹é½æ²¡è¾åºï¼å°±è¯´ææ¾å¡ä¸å¯¹ï¼ä»¥ä¸æ¯æçè¾åºï¼ +# 04:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1) +# 82:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1) +``` + + + +### å®è£ nvidia é©±å¨ + +å®è£ nvidia driver/cudaè¦ç¡®ä¿å·²å®è£ çnvidia driver/cudaå·²è¢«æ¸ ç + +``` +# å¸è½½cudaï¼ +sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl + +# å¸è½½nvidia-driverï¼ +sudo /usr/bin/nvidia-uninstall +``` + +å®è£ nvidia-detectï¼ç¨äºæ£æ¥æ¾å¡çæ¬ + +``` +yum install nvidia-detect +# è¿è¡å½ä»¤ nvidia-detect -v è¿åç»æï¼ +nvidia-detect -v +Probing for supported NVIDIA devices... +[10de:13bb] NVIDIA Corporation GM107GL [Quadro K620] +This device requires the current 390.87 NVIDIA driver kmod-nvidia +[8086:1912] Intel Corporation HD Graphics 530 +An Intel display controller was also detected +``` + +注æè¿éçä¿¡æ¯ [Quadro K620] å390.87ã +ä¸è½½ [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html) + + +å®è£ åçä¸ç³»ååå¤å·¥ä½ + +``` +# è¥ç³»ç»å¾ä¹ æ²¡æ´æ°ï¼è¿å¥å¯è½èæ¶è¾é¿ +yum -y update +yum -y install kernel-devel + +yum -y install epel-release +yum -y install dkms + +# ç¦ç¨nouveau +vim /etc/default/grub #å¨âGRUB_CMDLINE_LINUXâ䏿·»å å 容 rd.driver.blacklist=nouveau nouveau.modeset=0 +grub2-mkconfig -o /boot/grub2/grub.cfg # çæé ç½® +vim /etc/modprobe.d/blacklist.conf # æå¼ï¼æ°å»ºï¼æä»¶ï¼æ·»å å 容blacklist nouveau + +mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img +dracut /boot/initramfs-$(uname -r).img $(uname -r) # æ´æ°é ç½®ï¼å¹¶éå¯ +reboot +``` + +弿ºå确认æ¯å¦ç¦ç¨ + +``` +lsmod | grep nouveau # åºè¯¥è¿å空 + +# å¼å§å®è£ +sh NVIDIA-Linux-x86_64-390.87.run +``` + +å®è£ è¿ç¨ä¸ï¼ä¼éå°ä¸äºéé¡¹ï¼ + +``` +Install NVIDIA's 32-bit compatibility libraries (Yes) +centos Install NVIDIA's 32-bit compatibility libraries (Yes) +Would you like to run the nvidia-xconfig utility to automatically update your X configuration file... (NO) +``` + + +æåæ¥ç nvidia gpu ç¶æ + +``` +nvidia-smi +``` + +referenceï¼ +https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html + + + +### å®è£ Docker + +``` +yum -y update +yum -y install yum-utils +yum-config-manager --add-repo https://yum.dockerproject.org/repo/main/centos/7 +yum -y update + +# æ¾ç¤º available çå®è£ å +yum search --showduplicates docker-engine + +# å®è£ 1.12.5 çæ¬ docker +yum -y --nogpgcheck install docker-engine-1.12.5* +systemctl start docker + +chown hadoop:netease /var/run/docker.sock +chown hadoop:netease /usr/bin/docker +``` + +Referenceï¼https://docs.docker.com/cs-engine/1.12/ + +### é ç½® Docker + +å¨ `/etc/docker/` ç®å½ä¸ï¼å建`daemon.json`æä»¶, æ·»å 以ä¸é ç½®ï¼åéå¦image_registry_ip, etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ipéè¦æ ¹æ®å ·ä½ç¯å¢ï¼è¿è¡ä¿®æ¹ + +``` +{ + "insecure-registries": ["${image_registry_ip}:5000"], + "cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379", + "cluster-advertise":"{localhost_ip}:2375", + "dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip1}"], + "hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"] +} +``` + +éå¯ docker daemonï¼ + +``` +sudo systemctl restart docker +``` + + + +### Docker EE version + +```bash +$ docker version + +Client: + Version: 1.12.5 + API version: 1.24 + Go version: go1.6.4 + Git commit: 7392c3b + Built: Fri Dec 16 02:23:59 2016 + OS/Arch: linux/amd64 + +Server: + Version: 1.12.5 + API version: 1.24 + Go version: go1.6.4 + Git commit: 7392c3b + Built: Fri Dec 16 02:23:59 2016 + OS/Arch: linux/amd64 +``` + +### å®è£ nvidia-docker + +Hadoop-3.2 ç submarine 使ç¨çæ¯ 1.0 çæ¬ç nvidia-docker + +``` +wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker-1.0.1-1.x86_64.rpm +sudo rpm -i /tmp/nvidia-docker*.rpm +# å¯å¨ nvidia-docker +sudo systemctl start nvidia-docker + +# æ¥ç nvidia-docker ç¶æï¼ +systemctl status nvidia-docker + +# æ¥ç nvidia-docker æ¥å¿ï¼ +journalctl -u nvidia-docker + +# æ¥ç nvidia-docker-plugin æ¯å¦æ£å¸¸ +curl http://localhost:3476/v1.0/docker/cli +``` + +å¨ `/var/lib/nvidia-docker/volumes/nvidia_driver/` è·¯å¾ä¸ï¼æ ¹æ® `nvidia-driver` ççæ¬å建æä»¶å¤¹ï¼ + +``` +mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87 +# å ¶ä¸390.87æ¯nvidia driverççæ¬å· + +mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin +mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64 + +cp /usr/bin/nvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin +cp /usr/lib64/libcuda* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64 +cp /usr/lib64/libnvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64 + +# Test nvidia-smi +nvidia-docker run --rm nvidia/cuda:9.0-devel nvidia-smi +``` + +æµè¯ docker, nvidia-docker, nvidia-driver å®è£ + +``` +# æµè¯ä¸ +nvidia-docker run -rm nvidia/cuda nvidia-smi +``` + +``` +# æµè¯äº +nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash +# å¨docker䏿§è¡ +python +import tensorflow as tf +tf.test.is_gpu_available() +``` + +å¸è½½ nvidia-docker 1.0 çæ¹æ³ï¼ +https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0) + +reference: +https://github.com/NVIDIA/nvidia-docker/tree/1.0 + + + +### Tensorflow Image + +CUDNN å CUDA å ¶å®ä¸éè¦å¨ç©çæºä¸å®è£ ï¼å 为 Sumbmarine 䏿ä¾äºå·²ç»å å«äºCUDNN å CUDA çéåæä»¶ï¼åºç¡çDockfileå¯åè§WriteDockerfile.md + + +ä¸è¿°imagesæ æ³æ¯ækerberosç¯å¢ï¼å¦æéè¦kerberoså¯ä»¥ä½¿ç¨å¦ä¸Dockfile + +```shell +FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 + +# Pick up some TF dependencies +RUN apt-get update && apt-get install -y --allow-downgrades --no-install-recommends \ + build-essential \ + cuda-command-line-tools-9-0 \ + cuda-cublas-9-0 \ + cuda-cufft-9-0 \ + cuda-curand-9-0 \ + cuda-cusolver-9-0 \ + cuda-cusparse-9-0 \ + curl \ + libcudnn7=7.0.5.15-1+cuda9.0 \ + libfreetype6-dev \ + libpng12-dev \ + libzmq3-dev \ + pkg-config \ + python \ + python-dev \ + rsync \ + software-properties-common \ + unzip \ + && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* + +RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean + +RUN curl -O https://bootstrap.pypa.io/get-pip.py && \ + python get-pip.py && \ + rm get-pip.py + +RUN pip --no-cache-dir install \ + Pillow \ + h5py \ + ipykernel \ + jupyter \ + matplotlib \ + numpy \ + pandas \ + scipy \ + sklearn \ + && \ + python -m ipykernel.kernelspec + +# Install TensorFlow GPU version. +RUN pip --no-cache-dir install \ + http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.8.0-cp27-none-linux_x86_64.whl +RUN apt-get update && apt-get install git -y + +RUN apt-get update && apt-get install -y openjdk-8-jdk wget +# ä¸è½½ hadoop-3.1.1.tar.gz +RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz +RUN tar zxf hadoop-3.1.1.tar.gz +RUN mv hadoop-3.1.1 hadoop-3.1.0 + +# ä¸è½½æ¯ækerberosçjdkå®è£ å +RUN wget -qO jdk8.tar.gz 'http://${kerberos_jdk_url}/jdk-8u152-linux-x64.tar.gz' +RUN tar xzf jdk8.tar.gz -C /opt +RUN mv /opt/jdk* /opt/java +RUN rm jdk8.tar.gz +RUN update-alternatives --install /usr/bin/java java /opt/java/bin/java 100 +RUN update-alternatives --install /usr/bin/javac javac /opt/java/bin/javac 100 + +ENV JAVA_HOME /opt/java +ENV PATH $PATH:$JAVA_HOME/bin +``` + + +### æµè¯ TF ç¯å¢ + +å建好 docker éååï¼éè¦å æå¨æ£æ¥ TensorFlow æ¯å¦å¯ä»¥æ£å¸¸ä½¿ç¨ï¼é¿å éè¿ YARN è°åº¦ååºç°é®é¢ï¼å¯ä»¥æ§è¡ä»¥ä¸å½ä»¤ + +```shell +$ docker run -it ${docker_image_name} /bin/bash +# >>> è¿å ¥å®¹å¨ +$ python +$ python >> import tensorflow as tf +$ python >> tf.__version__ +``` + +妿åºç°é®é¢ï¼å¯ä»¥æç §ä»¥ä¸è·¯å¾è¿è¡ææ¥ + +1. ç¯å¢å鿝å¦è®¾ç½®æ£ç¡® + + ``` + echo $LD_LIBRARY_PATH + /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 + ``` + +2. libcuda.so.1,libcuda.soæ¯å¦å¨LD_LIBRARY_PATHæå®çè·¯å¾ä¸ + + ``` + ls -l /usr/local/nvidia/lib64 | grep libcuda.so + ``` + +### å®è£ Etcd + +è¿è¡ Submarine/install.sh èæ¬ï¼å°±å¯ä»¥å¨æå®æå¡å¨ä¸å®è£ Etcd ç»ä»¶åæå¡èªå¯å¨èæ¬ã + +```shell +$ ./Submarine/install.sh +# éè¿å¦ä¸å½ä»¤æ¥ç Etcd æå¡ç¶æ +systemctl status Etcd.service +``` + +æ£æ¥ Etcd æå¡ç¶æ + +```shell +$ etcdctl cluster-health +member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379 +member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379 +member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379 +cluster is healthy + +$ etcdctl member list +3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false +85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false +b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true +``` +å ¶ä¸ï¼${etcd_host_ip*} æ¯etcdæå¡å¨çip + + +### å®è£ Calico + +è¿è¡ Submarine/install.sh èæ¬ï¼å°±å¯ä»¥å¨æå®æå¡å¨ä¸å®è£ Calico ç»ä»¶åæå¡èªå¯å¨èæ¬ã + +``` +systemctl start calico-node.service +systemctl status calico-node.service +``` + +#### æ£æ¥ Calico ç½ç» + +```shell +# æ§è¡å¦ä¸å½ä»¤ï¼æ³¨æï¼ä¸ä¼æ¾ç¤ºæ¬æå¡å¨çç¶æï¼åªæ¾ç¤ºå ¶ä»çæå¡å¨ç¶æ +$ calicoctl node status +Calico process is running. + +IPv4 BGP status ++---------------+-------------------+-------+------------+-------------+ +| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO | ++---------------+-------------------+-------+------------+-------------+ +| ${host_ip1} | node-to-node mesh | up | 2018-09-21 | Established | +| ${host_ip2} | node-to-node mesh | up | 2018-09-21 | Established | +| ${host_ip3} | node-to-node mesh | up | 2018-09-21 | Established | ++---------------+-------------------+-------+------------+-------------+ + +IPv6 BGP status +No IPv6 peers found. +``` + +å建docker containerï¼éªè¯calicoç½ç» + +``` +docker network create --driver calico --ipam-driver calico-ipam calico-network +docker run --net calico-network --name workload-A -tid busybox +docker run --net calico-network --name workload-B -tid busybox +docker exec workload-A ping workload-B +``` + + +## å®è£ Hadoop + +### ç¼è¯ Hadoop + +``` +mvn package -Pdist -DskipTests -Dtar +``` + + + +### å¯å¨ YARNæå¡ + +``` +YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager +YARN_LOGFILE=nodemanager.log ./sbin/yarn-daemon.sh start nodemanager +YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver +YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver +``` + +### å¯å¨ registery dns æå¡ + +``` +sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns +``` + + + +### æµè¯ wordcount + +éè¿æµè¯æç®åç wordcount ï¼æ£æ¥ YARN æ¯å¦æ£ç¡®å®è£ + +``` +./bin/hadoop jar /home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar wordcount /tmp/wordcount.txt /tmp/wordcount-output4 +``` + + + +## 使ç¨CUPçTensorflowä»»å¡ + +### åæºæ¨¡å¼ + +#### æ¸ çéåç¨åº + +```bash +# æ¯æ¬¡æäº¤åéè¦æ§è¡ï¼ +./bin/yarn app -destroy standalone-tf +# å¹¶å é¤hdfsè·¯å¾ï¼ +./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir +# ç¡®ä¿ä¹åçä»»å¡å·²ç»ç»æ +``` +å ¶ä¸ï¼åé${dfs_name_service}è¯·æ ¹æ®ç¯å¢ï¼ç¨ä½ çhdfs name serviceåç§°æ¿æ¢ + +#### æ§è¡åæºæ¨¡å¼çtensorflowä»»å¡ + +```bash +./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ + --env DOCKER_JAVA_HOME=/opt/java \ + --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name standalone-tf \ + --docker_image dockerfile-cpu-tf1.8.0-with-models \ + --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \ + --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-checkpoint \ + --worker_resources memory=4G,vcores=2 --verbose \ + --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --num-gpus=0" +``` + + +### åå¸å¼æ¨¡å¼ + +#### æ¸ çéåç¨åº + +```bash +# æ¯æ¬¡æäº¤åéè¦æ§è¡ï¼ +./bin/yarn app -destroy distributed-tf +# å¹¶å é¤hdfsè·¯å¾ï¼ +./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir +# ç¡®ä¿ä¹åçä»»å¡å·²ç»ç»æ +``` + +#### æäº¤åå¸å¼æ¨¡å¼ tensorflow ä»»å¡ + +```bash +./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ + --env DOCKER_JAVA_HOME=/opt/java \ + --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf \ + --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ + --docker_image dockerfile-cpu-tf1.8.0-with-models \ + --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \ + --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \ + --worker_resources memory=4G,vcores=2 --verbose \ + --num_ps 1 \ + --ps_resources memory=4G,vcores=2 \ + --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \ + --num_workers 4 \ + --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${${dfs_name_service}}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0" +``` + + +## 使ç¨GPUçTensorflowä»»å¡ + +### Resourcemanager, Nodemanager 䏿·»å GPUæ¯æ + +å¨ yarn é ç½®æä»¶å¤¹(confæetc/hadoop)ä¸å建 resource-types.xmlï¼æ·»å ï¼ + + ``` + <configuration> + <property> + <name>yarn.resource-types</name> + <value>yarn.io/gpu</value> + </property> + </configuration> + ``` + +### Resourcemanager ç GPU é ç½® + +resourcemanager 使ç¨ç scheduler å¿ é¡»æ¯ capacity schedulerï¼å¨ capacity-scheduler.xml ä¸ä¿®æ¹å±æ§ï¼ + + ``` + <configuration> + <property> + <name>yarn.scheduler.capacity.resource-calculator</name> + <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value> + </property> + </configuration> + ``` + +### Nodemanager ç GPU é ç½® + +å¨ nodemanager ç yarn-site.xml 䏿·»å é ç½®ï¼ + + ``` + <configuration> + <property> + <name>yarn.nodemanager.resource-plugins</name> + <value>yarn.io/gpu</value> + </property> + </configuration> + ``` + +å¨ container-executor.cfg 䏿·»å é ç½®ï¼ + + ``` + [docker] + ... + # å¨[docker]å·²æé ç½®ä¸ï¼æ·»å 以ä¸å å®¹ï¼ + # /usr/bin/nvidia-dockeræ¯nvidia-dockerè·¯å¾ + # nvidia_driver_375.26ççæ¬å·375.26ï¼å¯ä»¥ä½¿ç¨nvidia-smiæ¥ç + docker.allowed.volume-drivers=/usr/bin/nvidia-docker + docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0 + docker.allowed.ro-mounts=nvidia_driver_375.26 + + [gpu] + module.enabled=true + + [cgroups] + # /sys/fs/cgroupæ¯cgroupçmountè·¯å¾ + # /hadoop-yarnæ¯yarnå¨cgroupè·¯å¾ä¸é»è®¤å建çpath + root=/sys/fs/cgroup + yarn-hierarchy=/hadoop-yarn + ``` + +### æäº¤éªè¯ + +Distributed-shell + GPU + cgroup + +```bash + ./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ + --env DOCKER_JAVA_HOME=/opt/java \ + --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ + --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ + --docker_image gpu-cuda9.0-tf1.8.0-with-models \ + --input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \ + --checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \ + --num_ps 0 \ + --ps_resources memory=4G,vcores=2,gpu=0 \ + --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \ + --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ + --num_workers 1 \ + --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" +``` + + + +## é®é¢ + +### é®é¢ä¸: æä½ç³»ç»éå¯å¯¼è´ nodemanager å¯å¨å¤±è´¥ + +``` +2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems! +org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58) + at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389) + at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997) +2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED +``` + +è§£å³æ¹æ³ï¼ä½¿ç¨ `root` è´¦å·ç» `yarn` ç¨æ·ä¿®æ¹ `/sys/fs/cgroup/cpu,cpuacct` çæé + +``` +chown :yarn -R /sys/fs/cgroup/cpu,cpuacct +chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct +``` + +卿¯ægpuæ¶ï¼è¿écgroup devicesè·¯å¾æé + +``` +chown :yarn -R /sys/fs/cgroup/devices +chmod g+rwx -R /sys/fs/cgroup/devices +``` + + +### é®é¢äºï¼container-executor æéé®é¢ + +``` +2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command: +java.io.IOException: Cannot run program "/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied + at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) + at org.apache.hadoop.util.Shell.runCommand(Shell.java:938) + at org.apache.hadoop.util.Shell.run(Shell.java:901) + at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213) +``` + +`/etc/yarn/sbin/Linux-amd64-64/container-executor` 该æä»¶çæéåºä¸º6050 + +### é®é¢ä¸ï¼æ¥çç³»ç»æå¡å¯å¨æ¥å¿ + +``` +journalctl -u docker +``` + +### é®é¢åï¼Docker æ æ³å é¤å®¹å¨çé®é¢ `device or resource busy` + +```bash +$ docker rm 0bfafa146431 +Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy +``` + +ç¼å `find-busy-mnt.sh` èæ¬ï¼æ£æ¥ `device or resource busy` ç¶æçå®¹å¨æè½½æä»¶ + +```bash +#!/bin/bash + +# A simple script to get information about mount points and pids and their +# mount namespaces. + +if [ $# -ne 1 ];then +echo "Usage: $0 <devicemapper-device-id>" +exit 1 +fi + +ID=$1 + +MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null` + +[ -z "$MOUNTS" ] && echo "No pids found" && exit 0 + +printf "PID\tNAME\t\tMNTNS\n" +echo "$MOUNTS" | while read LINE; do +PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3` +# Ignore self and thread-self +if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then + continue +fi +NAME=`ps -q $PID -o comm=` +MNTNS=`readlink /proc/$PID/ns/mnt` +printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS" +done +``` + +æ¥æ¾å ç¨ç®å½çè¿ç¨ + +```bash +$ chmod +x find-busy-mnt.sh +./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a +# PID NAME MNTNS +# 5007 ntpd mnt:[4026533598] +$ kill -9 5007 +``` + + +### é®é¢äºï¼å½ä»¤sudo nvidia-docker run æ¥é + +``` +docker: Error response from daemon: create nvidia_driver_361.42: VolumeDriver.Create: internal error, check logs for details. +See 'docker run --help'. +``` + +è§£å³æ¹æ³ï¼ + +``` +#æ¥çnvidia-dockerç¶æï¼æ¯ä¸æ¯å¯å¨æé®é¢ï¼å¯ä»¥ä½¿ç¨ +$ systemctl status nvidia-docker +$ journalctl -n -u nvidia-docker +#éå¯ä¸nvidia-docker +systemctl stop nvidia-docker +systemctl start nvidia-docker +``` + +### é®é¢å ï¼YARN å¯å¨å®¹å¨å¤±è´¥ + +å¦æä½ å建ç容卿°ï¼PS+Work>GPUæ¾å¡æ»æ°ï¼ï¼å¯è½ä¼åºç°å®¹å¨å建失败ï¼é£æ¯å 为å¨ä¸å°æå¡å¨ä¸åæ¶å建äºè¶ è¿æ¬æºæ¾å¡æ»æ°ç容å¨ã --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
