andkerber commented on issue #12204:
URL: https://github.com/apache/cloudstack/issues/12204#issuecomment-3928934474

   Just a quick update from my side. Cloudstack integration with an NVIDIA H100 
GPU in vGPU indeed works fine. I've failed to enable the vGPU profiles on the 
OS level and assumed that there might be some issue. I can't say much about the 
using MIG mode, which is the original topic of this issue report - I hope this 
did not raise too much confusion.
   
   For anyone stumbling across this post I'd like to leave some hints about 
using enabling vGPU profiles on the OS level so cloudstack can discover them 
sucessfully.
   
   # enable persistence mode
   /usr/bin/nvidia-smi -pm 1
   
   # disable mig mode 
   /usr/bin/nvidia-smi -mig 0
   
   # create the vGPU devices (needed after every reboot)
   /usr/lib/nvidia/sriov-manage -e 00000000:20:00.0
   
   # display all profiles/devices
   mdevctl types
   
   pick a profile that suits your needs. for example this one:
   
     nvidia-1070
       Available instances: 0
       Device API: vfio-pci
       Name: NVIDIA H100L-11C
       Description: num_heads=1, frl_config=60, framebuffer=11264M, 
max_resolution=4096x2400, max_instance=8
   
   let's say you want cloudstack to use 4 vGPUs with the profile spec mentioned 
above. 
   use "find" to give you the device path of 4 nvidia-1070:
   
   # find /sys | grep mdev_supported_types.nvidia-1070.*create | sed -e 
's/:/\\:/g' | head -4
   
/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:03.2/mdev_supported_types/nvidia-1070/create
   
/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:01.3/mdev_supported_types/nvidia-1070/create
   
/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:03.0/mdev_supported_types/nvidia-1070/create
   
/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:02.6/mdev_supported_types/nvidia-1070/create
   
   now use uuidgen on each device and write it's output to the file listed above
   
   # find /sys | grep mdev_supported_types.nvidia-1070.*create | sed -e 
's/:/\\:/g' | head -4 | awk '{print "uuidgen >"$1}'
   uuidgen 
>/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:03.2/mdev_supported_types/nvidia-1070/create
   uuidgen 
>/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:01.3/mdev_supported_types/nvidia-1070/create
   uuidgen 
>/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:03.0/mdev_supported_types/nvidia-1070/create
   uuidgen 
>/sys/devices/pci0000\:1f/0000\:1f\:01.0/0000\:20\:02.6/mdev_supported_types/nvidia-1070/create
   
   If the above is ok for you, execute the 4 commands.
   After that "mdevctl list" will show those 4 devices and cloudstack will be 
happy. 
   
   If you wan't these devices survive a reboot, you can "define" them and then 
configure them to "auto" like this:
   
   mdevctl list | grep manual | awk '{print "mdevctl define --uuid "$1}' | sh
   mdevctl list | grep manual | awk '{print "mdevctl modify --auto --uuid "$1}' 
| sh
   
   
   In my case i created 8 devices and the output of mdevctl list looks like 
this:
   
   # mdevctl list
   0b27fab9-8e8d-4ad7-91bd-6d7ed0b4440e 0000:20:00.7 nvidia-1070 auto (defined)
   482b292d-6b15-4370-a3a0-7fd96d8a0cc5 0000:20:01.1 nvidia-1070 auto (defined)
   b1efcc41-50f0-461f-a8cb-34ddb69f3820 0000:20:01.3 nvidia-1070 auto (defined)
   7946f615-17c5-4035-b401-73923f7f42e5 0000:20:02.4 nvidia-1070 auto (defined)
   ea734610-9edb-4db4-a299-3cfc04acd4e8 0000:20:02.6 nvidia-1070 auto (defined)
   b2bfcfef-e4e7-460f-9b0a-8f8d61df00b9 0000:20:03.0 nvidia-1070 auto (defined)
   d644316e-f2bd-4cb1-96ba-131404833e38 0000:20:03.2 nvidia-1070 auto (defined)
   e1be4558-1502-4703-a8dd-260576e0f224 0000:20:04.1 nvidia-1070 auto (defined)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to