jiwq commented on a change in pull request #106: SUBMARINE-289.Add the 
distributed applications case about Kaldi on Submarine
URL: https://github.com/apache/submarine/pull/106#discussion_r350557859
 
 

 ##########
 File path: docs/ecosystem/kaldi/RunningDistributedThchs30KaldiJobs.md
 ##########
 @@ -0,0 +1,678 @@
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+# Thchs30 Kaldi Example With YARN Service
+
+## Prepare data for training
+
+Thchs30 is a common benchmark in machine learning for speech data and 
transcripts. Below example is based on Thchs30 dataset.
+
+1) download gz file:
+```
+THCHS30_PATH=/data/hdfs1/nfs/aisearch/kaldi/thchs30
+mkdir $THCHS30_PATH/data && cd $THCHS30_PATH/data
+wget http://www.openslr.org/resources/18/data_thchs30.tgz
+wget http://www.openslr.org/resources/18/test-noise.tgz
+wget http://www.openslr.org/resources/18/resource.tgz
+```
+
+2) Checkout https://github.com/apache/submarine.git:
+```
+git clone https://github.com/apache/submarine.git
+```
+
+3) Go to `submarine/docker/ecosystem/`
+```
+cp -r ./kaldi/sge $THCHS30_PATH/sge
+```
+
+4) optional,Modify `/opt/kaldi/egs/thchs30/s5/cmd.sh` in the Container,This 
queue is used by default
+```
+export train_cmd="queue.pl -q all.q"
+```
+
+**Warning:**
+
+Please note that YARN service doesn't allow multiple services with the same 
name, so please run following command
+```
+yarn application -destroy <service-name>
+```
+to delete services if you want to reuse the same service name.
+
+## Prepare Docker images
+
+Refer to [Write Dockerfile](WriteDockerfileKaldi.md) to build a Docker image 
or use prebuilt one:
+
+- hadoopsubmarine/kaldi-latest-gpu-base:0.0.1
+
+## Run Kaldi jobs
+
+### Run distributed training
+
+```
+# Change the variables according to your needs
+SUBMARINE_VERSION=3.3.0-SNAPSHOT
+WORKER_NUM=2
+SGE_CFG_PATH=/cfg
+THCHS30_PATH=/data/hdfs1/nfs/aisearch/kaldi/thchs30
+DOCKER_HADOOP_HDFS_HOME=/app/${SUBMARINE_VERSION}
+
+# Dependent on registrydns, you must fill in < your RegistryDNSIP> in 
resolv.conf
+yarn jar /usr/local/matrix/share/hadoop/yarn/${SUBMARINE_VERSION}.jar \
+job run --name kaldi-thchs30-distributed \
+--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
+--env DOCKER_HADOOP_HDFS_HOME=$DOCKER_HADOOP_HDFS_HOME \
+--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
+--env PYTHONUNBUFFERED="0" \
+--env TZ="Asia/Shanghai" \
+--env 
YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=${THCHS30_PATH}/sge/resolv.conf:/etc/resolv.conf,\
+${THCHS30_PATH}/sge/passwd:/etc/passwd:rw,\
+${THCHS30_PATH}/sge/group:/etc/group:rw,\
+${THCHS30_PATH}/sge:$SGE_CFG_PATH,\
+${THCHS30_PATH}/data:/opt/kaldi/egs/thchs30,\
+${THCHS30_PATH}/mul/s5:/opt/kaldi/egs/mul-thchs30/s5 \
+--input_path /opt/kaldi/egs/thchs30/data \
+--docker_image hadoopsubmarine/kaldi-latest-gpu-base:0.0.1 \
+--num_workers $WORKER_NUM \
+--worker_resources memory=64G,vcores=32,gpu=1 \
+--worker_launch_cmd "sudo mkdir -p /opt/kaldi/egs/mul-thchs30/s5 && \
+sudo cp /opt/kaldi/egs/thchs30/s5/* /opt/kaldi/egs/mul-thchs30/s5 -r && \
+cluster_user=`whoami` domain_suffix="ml.com" && \
+cd /cfg && bash sge_run.sh $WORKER_NUM $SGE_CFG_PATH && \
+if [ $(echo $HOST_NAME |grep "^master-") ] then sleep 2m && cd 
/opt/kaldi/egs/mul-thchs30/s5 && ./run.sh fi" \
+--verbose
+```
+
+Explanations:
+
+- `>1` num_workers indicates it is a distributed training.
+- Parameters / resources / Docker image of parameter server can be specified 
separately. For many cases, parameter server doesn't require GPU.We don't need 
parameter server here
+
+For the meaning of the individual parameters, see the 
[QuickStart](../../helper/QuickStart.md) page!
+
+*Outputs of distributed training*
+
+Sample output of master:
+```
+...
+Reading package lists...
+Building dependency tree...
+Reading state information...
+The following additional packages will be installed:
+  bsd-mailx cpio gridengine-common ifupdown iproute2 isc-dhcp-client
+  isc-dhcp-common libatm1 libdns-export162 libisc-export160 liblockfile-bin
+  liblockfile1 libmnl0 libxmuu1 libxtables11 ncurses-term netbase
+  openssh-client openssh-server openssh-sftp-server postfix python3-chardet
+  python3-pkg-resources python3-requests python3-six python3-urllib3
+  ssh-import-id ssl-cert tcsh xauth
+Suggested packages:
+  libarchive1 gridengine-qmon ppp rdnssd iproute2-doc resolvconf avahi-autoipd
+  isc-dhcp-client-ddns apparmor ssh-askpass libpam-ssh keychain monkeysphere
+  rssh molly-guard ufw procmail postfix-mysql postfix-pgsql postfix-ldap
+  postfix-pcre sasl2-bin libsasl2-modules dovecot-common postfix-cdb
+  postfix-doc python3-setuptools python3-ndg-httpsclient python3-openssl
+  python3-pyasn1 openssl-blacklist
+The following NEW packages will be installed:
+  bsd-mailx cpio gridengine-client gridengine-common gridengine-exec
+  gridengine-master ifupdown iproute2 isc-dhcp-client isc-dhcp-common libatm1
+  libdns-export162 libisc-export160 liblockfile-bin liblockfile1 libmnl0
+  libxmuu1 libxtables11 ncurses-term netbase openssh-client openssh-server
+  openssh-sftp-server postfix python3-chardet python3-pkg-resources
+  python3-requests python3-six python3-urllib3 ssh-import-id ssl-cert tcsh
+  xauth
+0 upgraded, 33 newly installed, 0 to remove and 30 not upgraded.
+Need to get 12.1 MB of archives.
+After this operation, 65.8 MB of additional disk space will be used.
+Get:1 http://mirrors.aliyun.com/ubuntu xenial/main amd64 libatm1 amd64 
1:2.5.1-1.5 [24.2 kB]
 
 Review comment:
   Can replace the source?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to