[yunikorn-site] branch master updated: [YUNIKORN-1355] Generic example of GPU scheduling with Yunikorn (#200)

yuchaoran Wed, 30 Nov 2022 22:30:53 -0800

This is an automated email from the ASF dual-hosted git repository.

yuchaoran pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git



The following commit(s) were added to refs/heads/master by this push:
     new 9d5c98828 [YUNIKORN-1355] Generic example of GPU scheduling with 
Yunikorn (#200)
9d5c98828 is described below

commit 9d5c988281c25ff6ea755b3552e28a5d497880d0
Author: KatLantyss <[email protected]>
AuthorDate: Thu Dec 1 14:30:42 2022 +0800

    [YUNIKORN-1355] Generic example of GPU scheduling with Yunikorn (#200)
---
 docs/assets/yunikorn-gpu-time-slicing.png      | Bin 0 -> 40653 bytes
 docs/user_guide/workloads/run_nvidia.md        | 346 +++++++++++++++++++++++++
 docs/user_guide/workloads/run_tensorflow.md    | 244 +++++++++--------
 docs/user_guide/workloads/workload_overview.md |   1 +
 sidebars.js                                    |   1 +
 5 files changed, 468 insertions(+), 124 deletions(-)

diff --git a/docs/assets/yunikorn-gpu-time-slicing.png 
b/docs/assets/yunikorn-gpu-time-slicing.png
new file mode 100644
index 000000000..8b3d734a4
Binary files /dev/null and b/docs/assets/yunikorn-gpu-time-slicing.png differ
diff --git a/docs/user_guide/workloads/run_nvidia.md 
b/docs/user_guide/workloads/run_nvidia.md
new file mode 100644
index 000000000..644910851
--- /dev/null
+++ b/docs/user_guide/workloads/run_nvidia.md
@@ -0,0 +1,346 @@
+---
+id: run_nvidia
+title: Run NVIDIA GPU Jobs
+description: How to run generic example of GPU scheduling with Yunikorn.
+keywords:
+ - NVIDIA GPU
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Yunikorn with NVIDIA GPUs
+This guide gives an overview of how to set up NVIDIA Device Plugin which 
enable user to run GPUs with Yunikorn, for more details please check 
[**Kubernetes with 
GPUs**](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#option-2-installing-kubernetes-using-kubeadm).
+
+### Prerequisite
+Before following the steps below, Yunikorn need to deploy on the [**Kubernetes 
with 
GPUs**](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-kubernetes).
+
+### Install NVIDIA Device Plugin
+Add the nvidia-device-plugin helm repository.
+```
+helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
+helm repo update
+helm repo list
+```
+
+Verify the latest release version of the plugin is available.
+```
+helm search repo nvdp --devel
+NAME                             CHART VERSION  APP VERSION       DESCRIPTION
+nvdp/nvidia-device-plugin        0.12.3         0.12.3         A Helm chart 
for ...
+```
+
+Deploy the device plugin
+```
+kubectl create namespace nvidia
+helm install --generate-name nvdp/nvidia-device-plugin --namespace nvidia 
--version 0.12.3
+```
+
+Check the status of the pods to ensure NVIDIA device plugin is running
+```
+kubectl get pods -A
+
+NAMESPACE      NAME                                      READY   STATUS    
RESTARTS      AGE
+kube-flannel   kube-flannel-ds-j24fx                     1/1     Running   1 
(11h ago)   11h
+kube-system    coredns-78fcd69978-2x9l8                  1/1     Running   1 
(11h ago)   11h
+kube-system    coredns-78fcd69978-gszrw                  1/1     Running   1 
(11h ago)   11h
+kube-system    etcd-katlantyss-nzxt                      1/1     Running   3 
(11h ago)   11h
+kube-system    kube-apiserver-katlantyss-nzxt            1/1     Running   4 
(11h ago)   11h
+kube-system    kube-controller-manager-katlantyss-nzxt   1/1     Running   3 
(11h ago)   11h
+kube-system    kube-proxy-4wz7r                          1/1     Running   1 
(11h ago)   11h
+kube-system    kube-scheduler-katlantyss-nzxt            1/1     Running   4 
(11h ago)   11h
+kube-system    nvidia-device-plugin-1659451060-c92sb     1/1     Running   1 
(11h ago)   11h
+```
+
+### Testing NVIDIA Device Plugin
+Create a gpu test yaml file.
+```
+# gpu-pod.yaml
+       apiVersion: v1
+       kind: Pod
+       metadata:
+         name: gpu-operator-test
+       spec:
+         restartPolicy: OnFailure
+         containers:
+         - name: cuda-vector-add
+           image: "nvidia/samples:vectoradd-cuda10.2"
+           resources:
+             limits:
+                nvidia.com/gpu: 1
+```
+Deploy the application.
+```
+kubectl apply -f gpu-pod.yaml
+```
+Check the logs to ensure the app completed successfully.
+```
+kubectl get pods gpu-operator-test
+
+NAME                READY   STATUS      RESTARTS   AGE
+gpu-operator-test   0/1     Completed   0          9d
+```
+Check the result.
+```
+kubectl logs gpu-operator-test
+       
+[Vector addition of 50000 elements]
+Copy input data from the host memory to the CUDA device
+CUDA kernel launch with 196 blocks of 256 threads
+Copy output data from the CUDA device to the host memory
+Test PASSED
+Done
+```
+
+---
+## Enable GPU Time-Slicing (Optional)
+GPU time-slicing allow multi-tenant to share single GPU.
+To know how the GPU time-slicing works, please refer to [**Time-Slicing GPUs 
in 
Kubernetes**](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html#introduction).
 This page covers ways to enable GPU scheduling in Yunikorn using [**NVIDIA GPU 
Operator**](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/gpu-operator).
+
+
+### Configuration
+Specify multiple configurations in a `ConfigMap` as in the following example.
+```yaml
+# time-slicing-config.yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: time-slicing-config
+  namespace: nvidia
+data:
+    a100-40gb: |-
+        version: v1
+        sharing:
+          timeSlicing:
+            resources:
+            - name: nvidia.com/gpu
+              replicas: 8
+            - name: nvidia.com/mig-1g.5gb
+              replicas: 2
+            - name: nvidia.com/mig-2g.10gb
+              replicas: 2
+            - name: nvidia.com/mig-3g.20gb
+              replicas: 3
+            - name: nvidia.com/mig-7g.40gb
+              replicas: 7
+    rtx-3070: |-
+        version: v1
+        sharing:
+          timeSlicing:
+            resources:
+            - name: nvidia.com/gpu
+              replicas: 8
+```
+
+:::note
+If the GPU type in nodes do not include the a100-40gb or rtx-3070, you could 
modify the yaml file based on existing GPU types.
+For example, there are only multiple rtx-2080ti in the local kubernetes 
cluster.
+MIG is not supported by rtx-2080ti, so it could not replace the a100-40gb.
+Time slicing is supported by rtx-2080ti, so it could replace rtx-3070.
+:::
+
+:::info
+MIG support was added to Kubernetes in 2020. Refer to [**Supporting MIG in 
Kubernetes**](https://www.google.com/url?q=https://docs.google.com/document/d/1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g/edit&sa=D&source=editors&ust=1655578433019961&usg=AOvVaw1F-OezvM-Svwr1lLsdQmu3)
 for details on how this works.
+:::
+
+Create a `ConfigMap` in the operator namespace. 
+```bash
+kubectl create namespace nvidia
+kubectl create -f time-slicing-config.yaml
+```
+
+### Install NVIDIA GPU Operator
+Add the nvidia-gpu-operator helm repository.
+```bash
+helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
+helm repo update
+helm repo list
+```
+
+Enabling shared access to GPUs with the NVIDIA GPU Operator.
+- During fresh install of the NVIDIA GPU Operator with time-slicing enabled.
+  ```bash
+  helm install gpu-operator nvidia/gpu-operator \
+      -n nvidia \
+      --set devicePlugin.config.name=time-slicing-config
+  ```
+
+- For dynamically enabling time-slicing with GPU Operator already installed.
+  ```bash
+  kubectl patch clusterpolicy/cluster-policy \
+  -n nvidia --type merge \
+  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config"}}}}'
+  ```
+
+### Applying the Time-Slicing Configuration
+There are two methods:
+- Across the cluster
+
+  Install the GPU Operator by passing the time-slicing `ConfigMap` name and 
the default configuration.
+  ```bash
+  kubectl patch clusterpolicy/cluster-policy \
+    -n nvidia --type merge \
+    -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", 
"default": "rtx-3070"}}}}'
+  ```
+
+- On certain nodes
+
+  Label the node with the required time-slicing configuration in the 
`ConfigMap`.
+  ```bash
+  kubectl label node <node-name> nvidia.com/device-plugin.config=rtx-3070
+  ```
+
+Once the GPU Operator and Time-Slicing GPUs is installed, check the status of 
the pods to ensure all the containers are running and the validation is 
complete.
+```bash
+kubectl get pods -n nvidia
+```
+
+```bash
+NAME                                                          READY   STATUS   
   RESTARTS   AGE
+gpu-feature-discovery-qbslx                                   2/2     Running  
   0          20h
+gpu-operator-7bdd8bf555-7clgv                                 1/1     Running  
   0          20h
+gpu-operator-node-feature-discovery-master-59b4b67f4f-q84zn   1/1     Running  
   0          20h
+gpu-operator-node-feature-discovery-worker-n58dv              1/1     Running  
   0          20h
+nvidia-container-toolkit-daemonset-8gv44                      1/1     Running  
   0          20h
+nvidia-cuda-validator-tstpk                                   0/1     
Completed   0          20h
+nvidia-dcgm-exporter-pgk7v                                    1/1     Running  
   1          20h
+nvidia-device-plugin-daemonset-w8hh4                          2/2     Running  
   0          20h
+nvidia-device-plugin-validator-qrpxx                          0/1     
Completed   0          20h
+nvidia-operator-validator-htp6b                               1/1     Running  
   0          20h
+```
+Verify that the time-slicing configuration is applied successfully.
+```bash
+kubectl describe node <node-name>
+```
+
+```bash
+...
+Capacity:
+  nvidia.com/gpu: 8
+...
+Allocatable:
+  nvidia.com/gpu: 8
+...
+```
+
+### Testing GPU Time-Slicing
+Create a wordload test file `plugin-test.yaml`.
+```yaml
+# plugin-test.yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: nvidia-plugin-test
+  labels:
+    app: nvidia-plugin-test
+spec:
+  replicas: 5
+  selector:
+    matchLabels:
+      app: nvidia-plugin-test
+  template:
+    metadata:
+      labels:
+        app: nvidia-plugin-test
+    spec:
+      tolerations:
+        - key: nvidia.com/gpu
+          operator: Exists
+          effect: NoSchedule
+      containers:
+        - name: dcgmproftester11
+          image: nvidia/samples:dcgmproftester-2.1.7-cuda11.2.2-ubuntu20.04
+          command: ["/bin/sh", "-c"]
+          args:
+            - while true; do /usr/bin/dcgmproftester11 --no-dcgm-validation -t 
1004 -d 300; sleep 30; done
+          resources:
+            limits:
+              nvidia.com/gpu: 1
+          securityContext:
+            capabilities:
+              add: ["SYS_ADMIN"]
+```
+
+Create a deployment with multiple replicas.
+```bash
+kubectl apply -f plugin-test.yaml
+```
+
+Verify that all five replicas are running.
+- In pods
+  ```bash
+  kubectl get pods
+  ```
+
+  ```bash
+  NAME                                  READY   STATUS    RESTARTS   AGE
+  nvidia-plugin-test-677775d6c5-bpsvn   1/1     Running   0          8m8s
+  nvidia-plugin-test-677775d6c5-m95zm   1/1     Running   0          8m8s
+  nvidia-plugin-test-677775d6c5-9kgzg   1/1     Running   0          8m8s
+  nvidia-plugin-test-677775d6c5-lrl2c   1/1     Running   0          8m8s
+  nvidia-plugin-test-677775d6c5-9r2pz   1/1     Running   0          8m8s
+  ```
+- In node
+  ```bash
+  kubectl describe node <node-name>
+  ```
+
+  ```bash
+  ...
+  Allocated resources:
+    (Total limits may be over 100 percent, i.e., overcommitted.)
+    Resource           Requests    Limits
+    --------           --------    ------
+    ...
+    nvidia.com/gpu     5           5
+  ...
+  ```
+- In NVIDIA system management Interface
+  ```bash
+  nvidia-smi
+  ```
+
+  ```bash
+  
+-----------------------------------------------------------------------------+
+  | NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8    
 |
+  
|-------------------------------+----------------------+----------------------+
+  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. 
ECC |
+  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute 
M. |
+  |                               |                      |               MIG 
M. |
+  
|===============================+======================+======================|
+  |   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  
N/A |
+  | 46%   86C    P2   214W / 220W |   4297MiB /  8192MiB |    100%      
Default |
+  |                               |                      |                  
N/A |
+  
+-------------------------------+----------------------+----------------------+
+                                                                               
 
+  
+-----------------------------------------------------------------------------+
+  | Processes:                                                                 
 |
+  |  GPU   GI   CI        PID   Type   Process name                  GPU 
Memory |
+  |        ID   ID                                                   Usage     
 |
+  
|=============================================================================|
+  |    0   N/A  N/A   1776886      C   /usr/bin/dcgmproftester11         
764MiB |
+  |    0   N/A  N/A   1776921      C   /usr/bin/dcgmproftester11         
764MiB |
+  |    0   N/A  N/A   1776937      C   /usr/bin/dcgmproftester11         
764MiB |
+  |    0   N/A  N/A   1777068      C   /usr/bin/dcgmproftester11         
764MiB |
+  |    0   N/A  N/A   1777079      C   /usr/bin/dcgmproftester11         
764MiB |
+  
+-----------------------------------------------------------------------------+
+  ```
+
+- In Yunikorn UI applications
+![](../../assets/yunikorn-gpu-time-slicing.png)
diff --git a/docs/user_guide/workloads/run_tensorflow.md 
b/docs/user_guide/workloads/run_tensorflow.md
index 152068bd1..c1375759f 100644
--- a/docs/user_guide/workloads/run_tensorflow.md
+++ b/docs/user_guide/workloads/run_tensorflow.md
@@ -92,141 +92,137 @@ please read the document 
[here](../../get_started/get_started.md#access-the-web-
 
 ![tf-job-on-ui](../../assets/tf-job-on-ui.png)
 
-## Using Time-Slicing GPU
-
-### Prerequisite
-To use Time-Slicing GPU your cluster must be configured to use GPUs and 
Time-Slicing GPUs.
-- Nodes must have GPUs attached.
-- Kubernetes version 1.24
-- GPU drivers must be installed on the cluster
-- Use the [GPU 
Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html)
 to automatically setup and manage the NVIDA software components on the worker 
nodes.
-- Set the Configuration of [Time-Slicing GPUs in 
Kubernetes](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html)
-
-
-
-Once the GPU Operator and Time-Slicing GPUs is installed, check the status of 
the pods to ensure all the containers are running and the validation is 
complete :
-```shell script
-kubectl get pod -n gpu-operator
-```
-```shell script
-NAME                                                          READY   STATUS   
   RESTARTS       AGE
-gpu-feature-discovery-fd5x4                                   2/2     Running  
   0              5d2h
-gpu-operator-569d9c8cb-kbn7s                                  1/1     Running  
   14 (39h ago)   5d2h
-gpu-operator-node-feature-discovery-master-84c7c7c6cf-f4sxz   1/1     Running  
   0              5d2h
-gpu-operator-node-feature-discovery-worker-p5plv              1/1     Running  
   8 (39h ago)    5d2h
-nvidia-container-toolkit-daemonset-zq766                      1/1     Running  
   0              5d2h
-nvidia-cuda-validator-5tldf                                   0/1     
Completed   0              5d2h
-nvidia-dcgm-exporter-95vm8                                    1/1     Running  
   0              5d2h
-nvidia-device-plugin-daemonset-7nzvf                          2/2     Running  
   0              5d2h
-nvidia-device-plugin-validator-gj7nn                          0/1     
Completed   0              5d2h
-nvidia-operator-validator-nz84d                               1/1     Running  
   0              5d2h
-```
-Verify that the time-slicing configuration is applied successfully :
+## Run a TensorFlow job with GPU scheduling
+To use Time-Slicing GPU your cluster must be configured to use [GPUs and 
Time-Slicing 
GPUs](https://yunikorn.apache.org/docs/next/user_guide/workloads/run_nvidia)
+This section covers a workload test scenario to validate TFJob with 
Time-slicing GPU.
 
-```shell script
+:::note
+Verify that the time-slicing configuration is applied successfully
+```bash
 kubectl describe node
 ```
 
-```shell script
+```bash
 Capacity:
-  nvidia.com/gpu:     16
+  nvidia.com/gpu:     8
 ...
 Allocatable:
-  nvidia.com/gpu:     16
+  nvidia.com/gpu:     8
 ...
 ```
-### Testing TensorFlow job with GPUs
-This section covers a workload test scenario to validate TFJob with 
Time-slicing GPU.
+:::
 
-1. Create a workload test file `tf-gpu.yaml` as follows:
-  ```shell script
-  vim tf-gpu.yaml
+Create a workload test file `tf-gpu.yaml`
+```yaml
+# tf-gpu.yaml
+apiVersion: "kubeflow.org/v1"
+kind: "TFJob"
+metadata:
+  name: "tf-smoke-gpu"
+  namespace: kubeflow
+spec:
+  tfReplicaSpecs:
+    PS:
+      replicas: 1
+      template:
+        metadata:
+          creationTimestamp: 
+          labels:
+            applicationId: "tf_job_20200521_001"
+        spec:
+          schedulerName: yunikorn
+          containers:
+            - args:
+                - python
+                - tf_cnn_benchmarks.py
+                - --batch_size=32
+                - --model=resnet50
+                - --variable_update=parameter_server
+                - --flush_stdout=true
+                - --num_gpus=1
+                - --local_parameter_device=cpu
+                - --device=cpu
+                - --data_format=NHWC
+              image: 
docker.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
+              name: tensorflow
+              ports:
+                - containerPort: 2222
+                  name: tfjob-port
+              workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
+          restartPolicy: OnFailure
+    Worker:
+      replicas: 1
+      template:
+        metadata:
+          creationTimestamp: null
+          labels:
+            applicationId: "tf_job_20200521_001"
+        spec:
+          schedulerName: yunikorn
+          containers:
+            - args:
+                - python
+                - tf_cnn_benchmarks.py
+                - --batch_size=32
+                - --model=resnet50
+                - --variable_update=parameter_server
+                - --flush_stdout=true
+                - --num_gpus=1
+                - --local_parameter_device=cpu
+                - --device=gpu
+                - --data_format=NHWC
+              image: 
docker.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
+              name: tensorflow
+              ports:
+                - containerPort: 2222
+                  name: tfjob-port
+              resources:
+                limits:
+                  nvidia.com/gpu: 2
+              workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
+          restartPolicy: OnFailure
+```
+Create the TFJob
+```bash
+kubectl apply -f tf-gpu.yaml
+kubectl get pods -n kubeflow
+```
+```bash
+NAME                                 READY   STATUS    RESTARTS   AGE
+tf-smoke-gpu-ps-0                    1/1     Running   0          18m
+tf-smoke-gpu-worker-0                1/1     Running   0          18m
+training-operator-7d98f9dd88-dd45l   1/1     Running   0          19m
+```
+
+Verify that TFJob are running.
+- In pod logs
+  ```bash
+  kubectl logs tf-smoke-gpu-worker-0 -n kubeflow
   ```
-  ```yaml
-  apiVersion: "kubeflow.org/v1"
-  kind: "TFJob"
-  metadata:
-    name: "tf-smoke-gpu"
-    namespace: kubeflow
-  spec:
-    tfReplicaSpecs:
-      PS:
-        replicas: 1
-        template:
-          metadata:
-            creationTimestamp: 
-            labels:
-              applicationId: "tf_job_20200521_001"
-          spec:
-            schedulerName: yunikorn
-            containers:
-              - args:
-                  - python
-                  - tf_cnn_benchmarks.py
-                  - --batch_size=32
-                  - --model=resnet50
-                  - --variable_update=parameter_server
-                  - --flush_stdout=true
-                  - --num_gpus=1
-                  - --local_parameter_device=cpu
-                  - --device=cpu
-                  - --data_format=NHWC
-                image: 
docker.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
-                name: tensorflow
-                ports:
-                  - containerPort: 2222
-                    name: tfjob-port
-                workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
-            restartPolicy: OnFailure
-      Worker:
-        replicas: 1
-        template:
-          metadata:
-            creationTimestamp: null
-            labels:
-              applicationId: "tf_job_20200521_001"
-          spec:
-            schedulerName: yunikorn
-            containers:
-              - args:
-                  - python
-                  - tf_cnn_benchmarks.py
-                  - --batch_size=32
-                  - --model=resnet50
-                  - --variable_update=parameter_server
-                  - --flush_stdout=true
-                  - --num_gpus=1
-                  - --local_parameter_device=cpu
-                  - --device=gpu
-                  - --data_format=NHWC
-                image: 
docker.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
-                name: tensorflow
-                ports:
-                  - containerPort: 2222
-                    name: tfjob-port
-                resources:
-                  limits:
-                    nvidia.com/gpu: 2
-                workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
-            restartPolicy: OnFailure
   ```
-2. Create the TFJob
-  ```shell script
-  kubectl apply -f tf-gpu.yaml
+  .......
+  ..Found device 0 with properties
+  ..name: NVIDIA GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71
+
+  .......
+  ..Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA 
GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6)
+  .......
+  ```
+
+- In node
+  ```bash
+  ...
+  Allocated resources:
+    (Total limits may be over 100 percent, i.e., overcommitted.)
+    Resource           Requests     Limits
+    --------           --------     ------
+    ...
+    nvidia.com/gpu     2            2
+  ...
   ```
-3. Verify that TFJob are running on YuniKorn:
+
+- In Yunikorn UI applications
   ![tf-job-gpu-on-ui](../../assets/tf-job-gpu-on-ui.png)
-    Check the log of the pod:
-    ```shell script
-    kubectl logs logs po/tf-smoke-gpu-worker-0 -n kubeflow
-    ```
-    ```
-    .......
-    ..Found device 0 with properties:
-    ..name: NVIDIA GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 
1.71
-
-    .......
-    ..Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA 
GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6)
-    .......
-    ```
-    ![tf-job-gpu-on-logs](../../assets/tf-job-gpu-on-logs.png)
\ No newline at end of file
+
+
+
diff --git a/docs/user_guide/workloads/workload_overview.md 
b/docs/user_guide/workloads/workload_overview.md
index c8722c7fe..7040e79bd 100644
--- a/docs/user_guide/workloads/workload_overview.md
+++ b/docs/user_guide/workloads/workload_overview.md
@@ -53,6 +53,7 @@ omitted as it will be set automatically on newly created pods.
 
 Examples of more advanced use cases can be found here:
 
+* [Run NVIDIA GPU Jobs](run_nvidia)
 * [Run Spark Jobs](run_spark)
 * [Run Flink Jobs](run_flink)
 * [Run TensorFlow Jobs](run_tf)
diff --git a/sidebars.js b/sidebars.js
index 7f8eb922c..9e3ad5fbd 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -36,6 +36,7 @@ module.exports = {
                 label: 'Workloads',
                 items: [
                     'user_guide/workloads/workload_overview',
+                    'user_guide/workloads/run_nvidia',
                     'user_guide/workloads/run_spark',
                     'user_guide/workloads/run_flink',
                     'user_guide/workloads/run_tf',

[yunikorn-site] branch master updated: [YUNIKORN-1355] Generic example of GPU scheduling with Yunikorn (#200)

Reply via email to